Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.

Consul CircleCI Discuss

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.

Consul provides several key features:

  • Multi-Datacenter - Consul is built to be datacenter aware, and can support any number of regions without complex configuration.

  • Service Mesh/Service Segmentation - Consul Connect enables secure service-to-service communication with automatic TLS encryption and identity-based authorization. Applications can use sidecar proxies in a service mesh configuration to establish TLS connections for inbound and outbound connections without being aware of Connect at all.

  • Service Discovery - Consul makes it simple for services to register themselves and to discover other services via a DNS or HTTP interface. External services such as SaaS providers can be registered as well.

  • Health Checking - Health Checking enables Consul to quickly alert operators about any issues in a cluster. The integration with service discovery prevents routing traffic to unhealthy hosts and enables service level circuit breakers.

  • Key/Value Storage - A flexible key/value store enables storing dynamic configuration, feature flagging, coordination, leader election and more. The simple HTTP API makes it easy to use anywhere.

Consul runs on Linux, macOS, FreeBSD, Solaris, and Windows and includes an optional browser based UI. A commercial version called Consul Enterprise is also available.

Please note: We take Consul's security and our users' trust very seriously. If you believe you have found a security issue in Consul, please responsibly disclose by contacting us at [email protected].

Quick Start

A few quick start guides are available on the Consul website:

Documentation

Full, comprehensive documentation is available on the Consul website:

https://www.consul.io/docs

Contributing

Thank you for your interest in contributing! Please refer to CONTRIBUTING.md for guidance. For contributions specifically to the browser based UI, please refer to the UI's README.md for guidance.

Owner
HashiCorp
Consistent workflows to provision, secure, connect, and run any infrastructure for any application.
HashiCorp
Comments
  • Node health flapping - EC2

    Node health flapping - EC2

    We have a five node Consul cluster handling roughly 30 nodes across 4 different AWS accounts in a shared VPC across different availability zones. For the most part, everything works great. However, quite frequently, a random node will flap from healthy to critical. The flapping happens on completely random nodes and no consistency whatsoever.

    Every time a node "flaps" it causes our consul-template, which populates our NGINX reverse-proxy config to reload. This is causes things like our Apache benchmark tests to fail.

    We are looking to use Consul for production, but this issue has caused a lot of people to worry about consistency.

    We also have all required TCP/UDP ports open through all the nodes, as well.

    We believe the issue is just a latency problem with the polling of serf. Is there a way to modify the serf health-check interval to adjust to geographical latency?

    Heres the log from one of the Consul servers:

        2015/09/01 17:46:13 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
        2015/09/01 17:46:13 [INFO] memberlist: Marking ip-10-190-71-44 as failed, suspect timeout reached
        2015/09/01 17:46:13 [INFO] serf: EventMemberFailed: ip-10-190-71-44 10.190.71.44
        2015/09/01 17:46:15 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
        2015/09/01 17:46:15 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
        2015/09/01 17:46:16 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
        2015/09/01 17:46:26 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
        2015/09/01 17:46:32 [INFO] serf: EventMemberJoin: ip-10-190-71-44 10.190.71.44
        2015/09/01 17:47:05 [INFO] serf: EventMemberFailed: ip-10-170-138-228 10.170.138.228
        2015/09/01 17:47:19 [INFO] memberlist: Marking ip-10-170-155-168 as failed, suspect timeout reached
        2015/09/01 17:47:19 [INFO] serf: EventMemberFailed: ip-10-170-155-168 10.170.155.168
        2015/09/01 17:47:23 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
        2015/09/01 17:47:44 [INFO] memberlist: Marking ip-10-190-71-44 as failed, suspect timeout reached
        2015/09/01 17:47:44 [INFO] serf: EventMemberFailed: ip-10-190-71-44 10.190.71.44
        2015/09/01 17:47:45 [INFO] serf: EventMemberJoin: ip-10-170-155-168 10.170.155.168
        2015/09/01 17:47:45 [INFO] serf: EventMemberJoin: ip-10-190-71-44 10.190.71.44
        2015/09/01 17:47:49 [INFO] memberlist: Marking ip-10-170-76-170 as failed, suspect timeout reached
        2015/09/01 17:47:49 [INFO] serf: EventMemberFailed: ip-10-170-76-170 10.170.76.170
        2015/09/01 17:47:50 [INFO] serf: EventMemberJoin: ip-10-170-76-170 10.170.76.170
        2015/09/01 17:47:50 [INFO] serf: EventMemberJoin: ip-10-170-138-228 10.170.138.228
        2015/09/01 17:48:00 [INFO] memberlist: Marking ip-10-170-155-168 as failed, suspect timeout reached
        2015/09/01 17:48:00 [INFO] serf: EventMemberFailed: ip-10-170-155-168 10.170.155.168
        2015/09/01 17:48:02 [INFO] serf: EventMemberFailed: ip-10-185-23-211 10.185.23.211
        2015/09/01 17:48:16 [INFO] serf: EventMemberJoin: ip-10-185-23-211 10.185.23.211
        2015/09/01 17:48:32 [INFO] memberlist: Marking ip-10-170-15-71 as failed, suspect timeout reached
        2015/09/01 17:48:32 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
        2015/09/01 17:48:33 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
        2015/09/01 17:48:45 [INFO] serf: EventMemberFailed: ip-10-185-23-210 10.185.23.210
        2015/09/01 17:48:46 [INFO] serf: EventMemberJoin: ip-10-185-23-210 10.185.23.210
        2015/09/01 17:48:55 [INFO] serf: EventMemberJoin: ip-10-170-155-168 10.170.155.168
        2015/09/01 17:49:00 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
        2015/09/01 17:49:00 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
        2015/09/01 17:49:20 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
        2015/09/01 17:49:32 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
        2015/09/01 17:49:32 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
        2015/09/01 17:49:38 [INFO] serf: EventMemberFailed: ip-10-170-155-168 10.170.155.168
        2015/09/01 17:49:38 [INFO] serf: EventMemberJoin: ip-10-170-155-168 10.170.155.168
        2015/09/01 17:49:40 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
        2015/09/01 17:49:51 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
        2015/09/01 17:49:51 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
        2015/09/01 17:49:52 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
        2015/09/01 17:49:56 [INFO] serf: EventMemberFailed: ip-10-185-15-217 10.185.15.217
        2015/09/01 17:49:56 [INFO] serf: EventMemberJoin: ip-10-185-15-217 10.185.15.217
        2015/09/01 17:50:04 [INFO] memberlist: Marking ip-10-190-13-188 as failed, suspect timeout reached
        2015/09/01 17:50:04 [INFO] serf: EventMemberFailed: ip-10-190-13-188 10.190.13.188
        2015/09/01 17:50:05 [INFO] serf: EventMemberJoin: ip-10-190-13-188 10.190.13.188
        2015/09/01 17:50:20 [INFO] serf: EventMemberFailed: ip-10-185-77-94 10.185.77.94
        2015/09/01 17:50:24 [INFO] memberlist: Marking ip-10-185-65-7 as failed, suspect timeout reached
        2015/09/01 17:50:24 [INFO] serf: EventMemberFailed: ip-10-185-65-7 10.185.65.7
        2015/09/01 17:50:31 [INFO] serf: EventMemberJoin: ip-10-185-77-94 10.185.77.94
        2015/09/01 17:50:47 [INFO] serf: EventMemberJoin: ip-10-185-65-7 10.185.65.7
        2015/09/01 17:51:01 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
        2015/09/01 17:51:02 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
        2015/09/01 17:51:09 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
        2015/09/01 17:51:09 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
        2015/09/01 17:51:43 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
        2015/09/01 17:51:45 [INFO] memberlist: Marking ip-10-170-15-71 as failed, suspect timeout reached
        2015/09/01 17:51:45 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
        2015/09/01 17:51:45 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
        2015/09/01 17:52:22 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
        2015/09/01 17:52:22 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
        2015/09/01 17:52:30 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    
  • Consul servers won't elect a leader

    Consul servers won't elect a leader

    I have 3 consul servers running (+ a handful of other nodes), and they can all speak to each other - or so I think; at least they're sending UDP messages between themselves. The logs still show [ERR] agent: failed to sync remote state: No cluster leader, so even if the servers know about each other, it looks like they fail to perform an actual leader election... Is there a way to trigger a leader election manually?

    I'm running consul 0.5.2 on all nodes.

  • Frequent membership loss for 50-100 node cluster in AWS

    Frequent membership loss for 50-100 node cluster in AWS

    The main problem we're seeing is we have KV data associated with a lock, and hence a session, and the agents associated with those sessions "frequently" lose membership briefly, invalidating the associated KV data. By "frequently", I mean that in a cluster consisting of 3 consul servers and 50-100 other nodes running consul agents, we see individual isolated incidents of agent membership lost every 3-4 hours, roughly (sometimes ~30 mins apart, sometimes 7-8 hours apart, but generally seems to be randomly distributed with an "average" of 3-4).

    We have a heterogeneous cluster configured as a LAN, deployed across 3 AZs in EC2. It appears that there is no correlation between the type of node and the loss of membership (e.g. if we have 10 instances of service A, and 50 of service B, we will see roughly 2 membership losses of A instances, and 10 instances of B losses in a period of a few days). It also appears to be independent of stress on the system, i.e. we see roughly the same distribution of membership loss when the system is idle as we do when it's under heavy load.

    We use consul for service discovery/DNS as well as a KV store. We use the KV store specifically for the purpose of:

    • maintaining presence -- certain nodes maintain the presence of a key with a TTL on it to indicate they are up and running, and when those disappear another node can react and re-schedule the workload that was being performed by that node, and
    • locks -- for multiple instances of the same service to determine which one actually provides the service, in the case where the service needs to be a singleton

    In addition to the occasional member loss, we also have the issue that when we roll the consul server cluster, which triggers a leader election, the KV store is temporarily unavailable. This potentially prevents a presence maintainer from bumping its lock before the TTL expires.

    Questions:

    1. Are there ways to configure things so that membership is more robust/more forgiving? We've seen that WAN has more conservative parameters, e.g. a larger SuspicionMult(iplicationFactor), but switching to WAN feels like it would just be papering over an issue, and would only reduce the frequency of this issue.
    2. Is there any advice for a better way to implement "presence maintenance"? Because we use locks, and hence sessions, we're sensitive to the lossiness of UDP and probabilistic nature of Gossip, even though the running application maintaining its presence in the KV stores is running fine. E.g. is there a way to establish a lock but side-step the coupling to the membership of its agent?
    3. What information would help diagnose this problem? We have plenty of server and agent configuration, consul server logs, consul agent logs, and application logs from services running alongside consul agents, all spanning several days.

    Thanks, Amit + @matt-royal

  • Unable to deregister a service

    Unable to deregister a service

    I brought this to the attention of the mailing list here. @slackpad asked me to go ahead and file a bug. Below summary of the issue from the discussion thread.

    We have services that are being orphaned and we cannot deregister them. The orphans show up under one or more of the master nodes. In our configuration the master nodes are dev-consul, dev-consul-s1, and dev-broker.

    The health check of the orphaned node looks something like the following:

    {
        "Node": "dev-consul",
        "CheckID": "service:discussion_8080",
        "Name": "Service 'discussion' check",
        "ServiceName": "discussion",
        "Notes": "",
        "Status": "critical",
        "ServiceID": "discussion_8080",
        "Output": ""
    }
    

    I attempted to deregister via:

    user@dev-consul $ curl -X PUT -d '{"CheckID": "service:discussion_8080", "ServiceID": "discussion_8080", "Node": "dev-consul", "Datacenter": "dev"}' http://localhost:8500/v1/catalog/deregister
    

    The node was removed but then reappears within 30-60s. As @slackpad's recommended, I tried deregistering with:

    user@dev-consul $ curl -v http://localhost:8500/v1/agent/service/deregister/discussion_8080
    user@dev-consul $ curl -v -X PUT -d'{"CheckID": "service:discussion_8080", "ServiceID": "discussion_8080", "Node": "dev-consul", "Datacenter": "dev"}' http://localhost:8500/v1/catalog/deregister
    

    Both commands returned status 200 OK. But that also failed. You can see the output in this gist as well as the debug logs from consul.

    From the debug logs in consul we see:

    Aug 20 16:57:45 dev-broker consul[2221]: agent: Deregistered service 'discussion_8080'
    Aug 20 16:57:45 dev-broker consul[2221]: agent: Check 'service:discussion_8080' in sync
    Aug 20 16:57:45 dev-broker consul[2221]: agent: Deregistered check 'service:discussion_8080'
    Aug 20 16:57:45 dev-broker consul[2221]: http: Request /v1/agent/service/deregister/discussion_8080 (19.73968ms)
    Aug 20 16:57:46 dev-broker consul[2221]: http: Request /v1/agent/check/pass/service:discussion_8080, error: CheckID does not have associated TTL
    Aug 20 16:57:46 dev-broker consul[2221]: http: Request /v1/agent/check/pass/service:discussion_8080 (246.298µs)
    Aug 20 16:57:47 dev-broker consul[2221]: agent: Synced service 'discussion_8080' <--- SHADY!
    

    The annotation is from @slackpad.

    It's also noteworthy that the orphans are always associated with one of the master nodes (e.g. dev-consul) and not the node (dev-mesos) that's running the service that was registered. I should also mention (it could be a coincidence) that the service (discussion) is also flapping though from what I can tell from the debug logs for consul on dev-mesos everything is fine.

    Our consul version:

    $ consul version
    Consul v0.5.2
    Consul Protocol: 2 (Understands back to: 1)
    

    Thanks!

  • Consul/raft consumes lot of space disk

    Consul/raft consumes lot of space disk

    Hello.

    We are using consul 0.5.0 and on one server (we have 3 servers), consul folder is really heavy:

    # du -sh *
    4.0K    checkpoint-signature
    3.2G    raft
    24K serf
    1.8G    tmp
    0   ui
    du -sh raft/*
    3.2G    raft/mdb
    4.0K    raft/peers.json
    552K    raft/snapshots
    

    If I run strings on the db file I have always this pattern:

    Token
    Index
    Term
    Type
    Data
    Datacenter
    DirEnt
    CreateIndex
    Flags
    5consul-alerts/checks/wax-prod.worker-1/_/memory_usage
    LockIndex
    ModifyIndex
    Session
    Value
    o{"Current":"passing","CurrentTimestamp":"2015-03-19T07:16:19.750627946Z","Pending":"","PendingTimestamp":"0001-01-01T00:00:00Z","HealthCheck":{"Node":"wax-prod.worker-1","CheckID":"memory_usage","Name":"memory_usage","Status":"passing","Notes":"","Output":"MEM OK - usage system memory: 17% (free: 1405 MB)\n","ServiceID":"","ServiceName":""},"ForNotification":false}
    

    May be, you to have to know, we are using consul-alerts too.

  • Consul should handle nodes changing IP addresses

    Consul should handle nodes changing IP addresses

    Thought this was captured but couldn't find an existing issue for this. Here's a discussion - https://groups.google.com/d/msgid/consul-tool/623398ba-1dee-4851-85a2-221ff539c355%40googlegroups.com?utm_medium=email&utm_source=footer. For servers we'd also need to address https://github.com/hashicorp/consul/issues/457.

    We are going to close other IP-related issues against this one to keep everything together. The Raft side should support this once you get to Raft protocol version 3, but we need to do testing and will likely have to burn down some small issues to complete this.

  • Agent will not start on machines without a private ip since it won't bind to any ip available.

    Agent will not start on machines without a private ip since it won't bind to any ip available.

    Consul agent won't start on our machines (that by default only have a public ip assigned, they are firewalled) since it won't bind to non-private ip's by default.

    A commandline option to override the behaviour of "only bind to private ip's by default" would help a lot. This option should change the current filters in consul for everything ip-related to allow any assigned ip to be used automatically.

  • Cluster becomes unresponsive and does not elect new leader after disk latency spike on leader

    Cluster becomes unresponsive and does not elect new leader after disk latency spike on leader

    Description of the Issue (and unexpected/desired result)

    Two times in a week, we've now had a situation where the leader node's VM experienced high cpu iowait levels for a few (~3) minutes, and disk latencies of 800+ milliseconds. This seems to lead to writes to the log failing and getting retried indefinitely, even after disk access times are back to normal. During this time, it spews lines to the log like consul.kvs: Apply failed: timed out enqueuing operation (and also for consul.session). For some reason, this does not trigger a leader election.

    It appears incoming connections are enqueued, until all file descriptors are consumed. Restarting the Consul service seems to be the only way to recover. The non-leader servers also run out of fd:s at about the same time.

    So, some questions here: Why does this not trigger a leader election already when the timeouts start happening? And why can Consul not recover after a few minutes of high disk latency?

    (Regarding the reason for the iowait, we're hosted in a public cloud, and according to the provider there is a possibility for other tenants to consume high amounts of IO when booting new VMs. They are working on thottling this in a good way, but regardless, Consul should handle this better)

    consul version

    Server: v0.8.2

    consul info

    Server:

    agent:
            check_monitors = 2
            check_ttls = 1
            checks = 13
            services = 14
    build:
            prerelease =
            revision = 6017484
            version = 0.8.2
    consul:
            bootstrap = false
            known_datacenters = 2
            leader = false
            leader_addr = 192.168.123.116:8300
            server = true
    raft:
            applied_index = 249572007
            commit_index = 249572007
            fsm_pending = 0
            last_contact = 39.594µs
            last_log_index = 249572008
            last_log_term = 6062
            last_snapshot_index = 249567365
            last_snapshot_term = 6062
            latest_configuration = [{Suffrage:Voter ID:192.168.123.118:8300 Address:192.168.123.118:8300} {Suffrage:Voter ID:192.168.123.116:8300 Address:192.168.123.116:8300} {Suffrage:Voter ID:192.168.123.154:8300 Address:192.168.123.154:8300}]
            latest_configuration_index = 45140126
            num_peers = 2
            protocol_version = 2
            protocol_version_max = 3
            protocol_version_min = 0
            snapshot_version_max = 1
            snapshot_version_min = 0
            state = Follower
            term = 6062
    runtime:
            arch = amd64
            cpu_count = 4
            goroutines = 446
            max_procs = 4
            os = linux
            version = go1.8.1
    serf_lan:
            encrypted = true
            event_queue = 0
            event_time = 3818
            failed = 0
            health_score = 0
            intent_queue = 0
            left = 0
            member_time = 2318
            members = 54
            query_queue = 0
            query_time = 69
    serf_wan:
            encrypted = true
            event_queue = 0
            event_time = 1
            failed = 0
            health_score = 0
            intent_queue = 0
            left = 0
            member_time = 232
            members = 6
            query_queue = 0
            query_time = 28
    

    (This is post-restart of the server, so not sure if it is much use)

    Operating system and Environment details

    Centos 7.2, Openstack VM

  • Duplicate Node IDs after upgrading to 1.2.3

    Duplicate Node IDs after upgrading to 1.2.3

    After upgrading to 1.2.3 hosts begin reporting the following when registering with the catalog. The node name for the host making the new reservation and the node name for the conflicted always match.

    [ERR] consul: "Catalog.Register" RPC failed to server xxx.xxx.xxx.xxx:8300: rpc error making call: failed inserting node: Error while renaming Node ID: "4833aa15-8428-1d1a-46d8-9dba157dbc60": Node name xxx.xxx.xxx is reserved by node c5dc5b48-f105-79f0-7910-3de6629fddd0 with name xxx.xxx.xxx

    Restarting consul on the host seems to resolve the issue at least temporarily. I was able to produce this on a mixed environment consisting of CentOS hosts ranging from major release 5 through 7 but most hosts on CentOS 7.

  • Servers can't agree on cluster leader after restart when gossiping on WAN

    Servers can't agree on cluster leader after restart when gossiping on WAN

    Hi,

    I'm running consul in a all "WAN" environment, one DC. All my boxes is in the same rack, but do not have a private lan to gossip over.

    The first time they join each other, with an empty /opt/consul directory they manage to join and agree on a leader.

    If I restart the cluster, they still connect and find each other - but they never seem to agree on a leader

    Node            Address              Status  Type    Build  Protocol
    consul01        195.1xx.35.xx1:8301  alive   server  0.4.1  2
    consul02        195.1xx.35.xx2:8301  alive   server  0.4.1  2
    consul03        195.1xx.35.xx3:8301  alive   server  0.4.1  2
    

    they just keep repeating 2014/11/05 13:09:41 [ERR] agent: failed to sync remote state: No cluster leader in the consul monitor output

    All nodes are started with /usr/local/bin/consul agent -config-dir /etc/consul

    server 1

    {
      "advertise_addr": "195.1xx.35.xx1",
      "bind_addr": "195.1xx.35.xx1",
      "bootstrap_expect": 3,
      "client_addr": "0.0.0.0",
      "data_dir": "/opt/consul",
      "datacenter": "online",
      "domain": "consul",
      "log_level": "INFO",
      "ports": {
        "dns": 53
      },
      "recursor": "8.8.8.8",
      "rejoin_after_leave": true,
      "retry_join": [
        "195.1xx.35.xx1",
        "195.1xx.35.xx2",
        "195.1xx.35.xx3"
      ],
      "server": true,
      "start_join": [
        "195.1xx.35.xx1",
        "195.1xx.35.xx2",
        "195.1xx.35.xx3"
      ],
      "ui_dir": "/opt/consul/ui"
    }
    

    server 2

    {
      "advertise_addr": "195.1xx.35.xx2",
      "bind_addr": "195.1xx.35.xx2",
      "bootstrap_expect": 3,
      "client_addr": "0.0.0.0",
      "data_dir": "/opt/consul",
      "datacenter": "online",
      "domain": "consul",
      "log_level": "INFO",
      "ports": {
        "dns": 53
      },
      "recursor": "8.8.8.8",
      "rejoin_after_leave": true,
      "retry_join": [
        "195.1xx.35.xx1",
        "195.1xx.35.xx2",
        "195.1xx.35.xx3"
      ],
      "server": true,
      "start_join": [
        "195.1xx.35.xx1",
        "195.1xx.35.xx2",
        "195.1xx.35.xx3"
      ],
      "ui_dir": "/opt/consul/ui"
    }
    

    server 3

    {
      "advertise_addr": "195.1xx.35.xx3",
      "bind_addr": "195.1xx.35.xx3",
      "bootstrap_expect": 3,
      "client_addr": "0.0.0.0",
      "data_dir": "/opt/consul",
      "datacenter": "online",
      "domain": "consul",
      "log_level": "INFO",
      "ports": {
        "dns": 53
      },
      "recursor": "8.8.8.8",
      "rejoin_after_leave": true,
      "retry_join": [
        "195.1xx.35.xx1",
        "195.1xx.35.xx2",
        "195.1xx.35.xx3"
      ],
      "server": true,
      "start_join": [
        "195.1xx.35.xx1",
        "195.1xx.35.xx2",
        "195.1xx.35.xx3"
      ],
      "ui_dir": "/opt/consul/ui"
    }
    
  • [Performance on large clusters] Performance degrades on health blocking queries to more than 682 instances

    [Performance on large clusters] Performance degrades on health blocking queries to more than 682 instances

    Overview of the Issue

    We had two incidents where we had a high load on Consul, causing high write and read latency as well as full Consul outage. After investigation we noticed that blocking queries against our two biggest services were hitting the hardcoded watchLimit value in https://github.com/hashicorp/consul/blob/master/agent/consul/state/state_store.go#L62. watchLimit is currently set to 2048, this sets the limit on how many instances a watched service can have before hitting this limit to ~ 2048 / 3 = 681 as 3 channels are added per instance in this loop : https://github.com/hashicorp/consul/blob/master/agent/consul/state/catalog.go#L1722.

    /v1/health/service/:service_name adds 3 watches per instance so the limit is ~682. /v1/catalog/service/:service_name only adds 1 watch per instance so its limit is 2048.

    We confirmed this theory by cutting down one of these big services to 681 instances. This greatly reduced the load on Consul servers and restored the DC back to normal :

    image

    We later rolled out a version of Consul with this limit raised up to 8192. This showed big improvement in both latencies and load, while cutting down memory usage by ~two. We then scaled back the service we cut down to its original 711 instances with minimal impact on the cluster :

    image

    Here is a profile ran during this incident :

    image

    The hot path is in blocking query resolution. With the less fine grained watch Store.ServiceNodes() is called on virtually every change happening in the cluster, generating the load seen above.

    Based on this information, hitting this limit greatly degrades Consul server's performance when the limit is slightly crossed (our biggest service has 1780 instances).

    We will soon provide a PR allowing to configure this limit.

    Reproduction Steps

    Register a service with more than 682 instances. Generate some load and run /health blocking queries against this service with many clients, in our case around ~2K.

    Consul info for both Client and Server

    Server and client v1.2.3 with patches.

    Operating system and Environment details

    Windows and Linux.

  • grpc: switch servers and retry on error

    grpc: switch servers and retry on error

    Description

    This is the OSS portion of enterprise PR 3822, it has been reviewed thoroughly there.

    It adds a custom gRPC balancer that replicates the router's server cycling behavior. It also enables automatic retries for RESOURCE_EXHAUSTED errors, which we now get for free.

    Testing & Reproduction steps

    The balancer package has unit tests that spin up real gRPC servers and clients.

    Also manually tested in enterprise by:

    • Hacking the partition read endpoint to return RESOURCE_EXHAUSTED if req.Name != NodeName
    • Running two servers and a client agent
    • Running consul partition read server1 and consul partition read server2
    • Watching the agent logs and observing:
    2022-12-09T12:15:45.850Z [TRACE] agent.grpc.balancer: witnessed RPC error: target=consul://dc1.7105466e-b7d7-8240-a500-88a14461aa49/server.dc1 server=dc1-127.0.0.1:9102 error="rpc error: code = ResourceExhausted desc = you got rate limited, my dude"
    2022-12-09T12:15:45.850Z [DEBUG] agent.grpc.balancer: switching server: target=consul://dc1.7105466e-b7d7-8240-a500-88a14461aa49/server.dc1 from=dc1-127.0.0.1:9102 to=dc1-127.0.0.1:9101
    2022-12-09T12:15:45.850Z [TRACE] agent.grpc.balancer: sub-connection state changed: target=consul://dc1.7105466e-b7d7-8240-a500-88a14461aa49/server.dc1 server=dc1-127.0.0.1:9101 state=CONNECTING
    2022-12-09T12:15:45.851Z [TRACE] agent.grpc.balancer: sub-connection state changed: target=consul://dc1.7105466e-b7d7-8240-a500-88a14461aa49/server.dc1 server=dc1-127.0.0.1:9101 state=READY
    2022-12-09T12:16:12.049Z [ERROR] agent.http: Request error: method=GET url=/v1/partition/foo from=127.0.0.1:62096 error="Partition not found for \"server2\""
    

    Note that the balancer automatically switched connections and retried against the other server 🙌🏻

  • emit metrics for global rate limiting

    emit metrics for global rate limiting

    Description

    Emits metrics for global rate limiting implementation.

    Testing & Reproduction steps

    • TODO: unit tests
    • Manually exceeded rate limit to v1/catalog/nodes via curl and then queried v1/agent/metrics endpoint:
            {
                "Name": "consul.consul.rate_limit",
                "Count": 27,
                "Rate": 2.7,
                "Sum": 27,
                "Min": 1,
                "Max": 1,
                "Mean": 1,
                "Stddev": 0,
                "Labels": {
                    "limit_type": "global/read",
                    "mode": "enforcing",
                    "op": "Catalog.ListNodes"
                }
            },
    

    PR Checklist

    • [ ] updated test coverage
    • [ ] external facing docs updated
    • [ ] not a security concern
  • docs: Consul at scale guide

    docs: Consul at scale guide

    Description

    A guide to deploying Consul at scale was drafted for inclusion in our docs. This PR stages the guide in the "Architecture" section, as the guide is primarily concerned with how deployments at scale have performance impacts as a result of the architectural design, and provides architectural recommendations.

    Please note two unresolved questions from the drafting process. @jkirschner-hashicorp: could you weigh in on my two comments with unresolved questions before we merge?

    Links

    PR Checklist

    • [ ] updated test coverage
    • [X] external facing docs updated
    • [X] not a security concern
  • chore(deps-dev): bump husky from 4.3.8 to 8.0.3 in /website

    chore(deps-dev): bump husky from 4.3.8 to 8.0.3 in /website

    Bumps husky from 4.3.8 to 8.0.3.

    Release notes

    Sourced from husky's releases.

    v8.0.3

    • fix: add git not installed message #1208

    v8.0.2

    • docs: remove deprecated npm set-script

    v8.0.1

    • fix: use POSIX equality operator

    v8.0.0

    What's Changed

    Feats

    • feat: add husky - prefix to logged global error messages by @​joshbalfour in typicode/husky#1092
    • feat: show PATH when command not found to improve debuggability
    • feat: drop Node 12 support
    • feat: skip install if $HUSKY=0

    Fixes

    Docs

    Chore

    v7.0.4

    No changes. Husky v7.0.3 was reverted, this version is the same as v7.0.2.

    v7.0.2

    Fix pre-commit hook in WebStorm (#1023)

    v7.0.1

    • Fix gracefully fail if Git command is not found #1003 (same as in v6)

    v7.0.0

    • Improve .husky/ directory structure. .husky/.gitignore is now unnecessary and can be removed.
    • Improve error output (shorter)
    • Update husky-init CLI
    • Update husky-4-to-7 CLI
    • Drop Node 10 support

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
  • docs: fix broken links

    docs: fix broken links

    Change Consul tutorials broken links

    Description

    Fix Consul tutorials broken links.

    PR Checklist

    • [ ] updated test coverage
    • [x] external facing docs updated
    • [x] not a security concern
  • Refactoring the peering integ test to accommodate coming changes of o…

    Refactoring the peering integ test to accommodate coming changes of o…

    Description

    Refactoring the peering test in container test to accommodate coming changes of other upgrade scenarios.

    • Add a utils package under consul-containers/test that contains methods to set up various test scenarios. For example, BasicPeeringTwoClustersSetup is used by peering test.
    • Deduplication: have a single CreatingPeeringClusterAndSetup replace CreatingAcceptingClusterAndSetup and CreateDialingClusterAndSetup.
    • Separate peering cluster creation and server registration. Previously, the CreatingAcceptingClusterAndSetup or CreateDialingClusterAndSetup both create clusters and register a service. The updated code run these steps sequentially in BasicPeeringTwoClustersSetup.

    Testing & Reproduction steps

    Links

    PR Checklist

    • [ ] updated test coverage
    • [ ] external facing docs updated
    • [x] not a security concern
Distributed reliable key-value store for the most critical data of a distributed system

etcd Note: The master branch may be in an unstable or even broken state during development. Please use releases instead of the master branch in order

Dec 28, 2022
HA LDAP based key/value solution for projects configuration storing with multi master replication support

Recon is the simple solution for storing configs of you application. There are no specified instruments, no specified data protocols. For the full power of Recon you only need curl.

Jun 15, 2022
Distributed cache and in-memory key/value data store. It can be used both as an embedded Go library and as a language-independent service.

Olric Distributed cache and in-memory key/value data store. It can be used both as an embedded Go library and as a language-independent service. With

Jan 4, 2023
GhostDB is a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.
GhostDB is a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.

GhostDB is designed to speed up dynamic database or API driven websites by storing data in RAM in order to reduce the number of times an external data source such as a database or API must be read. GhostDB provides a very large hash table that is distributed across multiple machines and stores large numbers of key-value pairs within the hash table.

Jan 6, 2023
An in-memory key:value store/cache (similar to Memcached) library for Go, suitable for single-machine applications.

go-cache go-cache is an in-memory key:value store/cache similar to memcached that is suitable for applications running on a single machine. Its major

Jan 3, 2023
Distributed disk storage database based on Raft and Redis protocol.
Distributed disk storage database based on Raft and Redis protocol.

IceFireDB Distributed disk storage system based on Raft and RESP protocol. High performance Distributed consistency Reliable LSM disk storage Cold and

Dec 27, 2022
Distributed, fault-tolerant key-value storage written in go.
Distributed, fault-tolerant key-value storage written in go.

A simple, distributed, fault-tolerant key-value storage inspired by Redis. It uses Raft protocotol as consensus algorithm. It supports the following data structures: String, Bitmap, Map, List.

Jan 3, 2023
Distributed key-value store
Distributed key-value store

Keva Distributed key-value store General Demo Start the server docker-compose up --build Insert data curl -XPOST http://localhost:5555/storage/test1

Nov 15, 2021
rgkv is a distributed kv storage service using raft consensus algorithm.
rgkv is a distributed kv storage service using raft consensus algorithm.

rgkv rgkv is a distributed kv storage service using raft consensus algorithm. Get/put/append operation High Availability Sharding Linearizability Tabl

Jan 15, 2022
A simple distributed kv system from scratch

SimpleKV A simple distributed key-value storage system based on bitcask from scratch. Target Here are some basic requirements: LRU Cache. An index sys

Apr 21, 2022
Simple Distributed key-value database (in-memory/disk) written with Golang.

Kallbaz DB Simple Distributed key-value store (in-memory/disk) written with Golang. Installation go get github.com/msam1r/kallbaz-db Usage API // Get

Jan 18, 2022
Implementation of distributed key-value system based on TiKV

Distributed_key-value_system A naive implementation of distributed key-value system based on TiKV Features Features of this work are listed below: Dis

Mar 7, 2022
Kdmq - Tool to query KDM data for a given Rancher version

kdmq (kdm query) Tool to query KDM data for a given Rancher version, think of: W

Feb 1, 2022
A key-value db api with multiple storage engines and key generation
A key-value db api with multiple storage engines and key generation

Jet is a deadly-simple key-value api. The main goals of this project are : Making a simple KV tool for our other projects. Learn tests writing and git

Apr 5, 2022
CrankDB is an ultra fast and very lightweight Key Value based Document Store.

CrankDB is a ultra fast, extreme lightweight Key Value based Document Store.

Apr 12, 2022
NutsDB a simple, fast, embeddable and persistent key/value store written in pure Go.
NutsDB a simple, fast, embeddable and persistent key/value store written in pure Go.

A simple, fast, embeddable, persistent key/value store written in pure Go. It supports fully serializable transactions and many data structures such as list, set, sorted set.

Jan 9, 2023
rosedb is a fast, stable and embedded key-value (k-v) storage engine based on bitcask.
rosedb is a fast, stable and embedded key-value (k-v) storage engine based on bitcask.

rosedb is a fast, stable and embedded key-value (k-v) storage engine based on bitcask. Its on-disk files are organized as WAL(Write Ahead Log) in LSM trees, optimizing for write throughput.

Dec 28, 2022
KV - a toy in-memory key value store built primarily in an effort to write more go and check out grpc

KV KV is a toy in-memory key value store built primarily in an effort to write more go and check out grpc. This is still a work in progress. // downlo

Dec 30, 2021
The Consul API Gateway is a dedicated ingress solution for intelligently routing traffic to applications running on a Consul Service Mesh.

The Consul API Gateway is a dedicated ingress solution for intelligently routing traffic to applications running on a Consul Service Mesh.

Dec 14, 2022
BlobStore is a highly reliable,highly available and ultra-large scale distributed storage system

BlobStore Overview Documents Build BlobStore Deploy BlobStore Manage BlobStore License Overview BlobStore is a highly reliable,highly available and ul

Oct 10, 2022