Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.

Last update: Jan 2, 2023

Comments: 17

Consul

Website: https://www.consul.io
Tutorials: HashiCorp Learn
Forum: Discuss

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.

Consul provides several key features:

Multi-Datacenter - Consul is built to be datacenter aware, and can support any number of regions without complex configuration.
Service Mesh/Service Segmentation - Consul Connect enables secure service-to-service communication with automatic TLS encryption and identity-based authorization. Applications can use sidecar proxies in a service mesh configuration to establish TLS connections for inbound and outbound connections without being aware of Connect at all.
Service Discovery - Consul makes it simple for services to register themselves and to discover other services via a DNS or HTTP interface. External services such as SaaS providers can be registered as well.
Health Checking - Health Checking enables Consul to quickly alert operators about any issues in a cluster. The integration with service discovery prevents routing traffic to unhealthy hosts and enables service level circuit breakers.
Key/Value Storage - A flexible key/value store enables storing dynamic configuration, feature flagging, coordination, leader election and more. The simple HTTP API makes it easy to use anywhere.

Consul runs on Linux, macOS, FreeBSD, Solaris, and Windows and includes an optional browser based UI. A commercial version called Consul Enterprise is also available.

Please note: We take Consul's security and our users' trust very seriously. If you believe you have found a security issue in Consul, please responsibly disclose by contacting us at [email protected].

Quick Start

A few quick start guides are available on the Consul website:

Standalone binary install: https://learn.hashicorp.com/tutorials/consul/get-started-install
Minikube install: https://learn.hashicorp.com/tutorials/consul/kubernetes-minikube
Kind install: https://learn.hashicorp.com/tutorials/consul/kubernetes-kind
Kubernetes install: https://learn.hashicorp.com/tutorials/consul/kubernetes-deployment-guide

Documentation

Full, comprehensive documentation is available on the Consul website:

https://www.consul.io/docs

Contributing

Thank you for your interest in contributing! Please refer to CONTRIBUTING.md for guidance. For contributions specifically to the browser based UI, please refer to the UI's README.md for guidance.

Owner

HashiCorp

Consistent workflows to provision, secure, connect, and run any infrastructure for any application.

https://github.com/hashicorp/consul https://www.consul.io

Comments

Node health flapping - EC2

We have a five node Consul cluster handling roughly 30 nodes across 4 different AWS accounts in a shared VPC across different availability zones. For the most part, everything works great. However, quite frequently, a random node will flap from healthy to critical. The flapping happens on completely random nodes and no consistency whatsoever.

Every time a node "flaps" it causes our consul-template, which populates our NGINX reverse-proxy config to reload. This is causes things like our Apache benchmark tests to fail.

We are looking to use Consul for production, but this issue has caused a lot of people to worry about consistency.

We also have all required TCP/UDP ports open through all the nodes, as well.

We believe the issue is just a latency problem with the polling of serf. Is there a way to modify the serf health-check interval to adjust to geographical latency?

Heres the log from one of the Consul servers:

    2015/09/01 17:46:13 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:46:13 [INFO] memberlist: Marking ip-10-190-71-44 as failed, suspect timeout reached
    2015/09/01 17:46:13 [INFO] serf: EventMemberFailed: ip-10-190-71-44 10.190.71.44
    2015/09/01 17:46:15 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:46:15 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:46:16 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:46:26 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:46:32 [INFO] serf: EventMemberJoin: ip-10-190-71-44 10.190.71.44
    2015/09/01 17:47:05 [INFO] serf: EventMemberFailed: ip-10-170-138-228 10.170.138.228
    2015/09/01 17:47:19 [INFO] memberlist: Marking ip-10-170-155-168 as failed, suspect timeout reached
    2015/09/01 17:47:19 [INFO] serf: EventMemberFailed: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:47:23 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:47:44 [INFO] memberlist: Marking ip-10-190-71-44 as failed, suspect timeout reached
    2015/09/01 17:47:44 [INFO] serf: EventMemberFailed: ip-10-190-71-44 10.190.71.44
    2015/09/01 17:47:45 [INFO] serf: EventMemberJoin: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:47:45 [INFO] serf: EventMemberJoin: ip-10-190-71-44 10.190.71.44
    2015/09/01 17:47:49 [INFO] memberlist: Marking ip-10-170-76-170 as failed, suspect timeout reached
    2015/09/01 17:47:49 [INFO] serf: EventMemberFailed: ip-10-170-76-170 10.170.76.170
    2015/09/01 17:47:50 [INFO] serf: EventMemberJoin: ip-10-170-76-170 10.170.76.170
    2015/09/01 17:47:50 [INFO] serf: EventMemberJoin: ip-10-170-138-228 10.170.138.228
    2015/09/01 17:48:00 [INFO] memberlist: Marking ip-10-170-155-168 as failed, suspect timeout reached
    2015/09/01 17:48:00 [INFO] serf: EventMemberFailed: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:48:02 [INFO] serf: EventMemberFailed: ip-10-185-23-211 10.185.23.211
    2015/09/01 17:48:16 [INFO] serf: EventMemberJoin: ip-10-185-23-211 10.185.23.211
    2015/09/01 17:48:32 [INFO] memberlist: Marking ip-10-170-15-71 as failed, suspect timeout reached
    2015/09/01 17:48:32 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:48:33 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:48:45 [INFO] serf: EventMemberFailed: ip-10-185-23-210 10.185.23.210
    2015/09/01 17:48:46 [INFO] serf: EventMemberJoin: ip-10-185-23-210 10.185.23.210
    2015/09/01 17:48:55 [INFO] serf: EventMemberJoin: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:49:00 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:49:00 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:20 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:32 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:49:32 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:38 [INFO] serf: EventMemberFailed: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:49:38 [INFO] serf: EventMemberJoin: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:49:40 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:51 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:49:51 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:52 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:56 [INFO] serf: EventMemberFailed: ip-10-185-15-217 10.185.15.217
    2015/09/01 17:49:56 [INFO] serf: EventMemberJoin: ip-10-185-15-217 10.185.15.217
    2015/09/01 17:50:04 [INFO] memberlist: Marking ip-10-190-13-188 as failed, suspect timeout reached
    2015/09/01 17:50:04 [INFO] serf: EventMemberFailed: ip-10-190-13-188 10.190.13.188
    2015/09/01 17:50:05 [INFO] serf: EventMemberJoin: ip-10-190-13-188 10.190.13.188
    2015/09/01 17:50:20 [INFO] serf: EventMemberFailed: ip-10-185-77-94 10.185.77.94
    2015/09/01 17:50:24 [INFO] memberlist: Marking ip-10-185-65-7 as failed, suspect timeout reached
    2015/09/01 17:50:24 [INFO] serf: EventMemberFailed: ip-10-185-65-7 10.185.65.7
    2015/09/01 17:50:31 [INFO] serf: EventMemberJoin: ip-10-185-77-94 10.185.77.94
    2015/09/01 17:50:47 [INFO] serf: EventMemberJoin: ip-10-185-65-7 10.185.65.7
    2015/09/01 17:51:01 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:51:02 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:51:09 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:51:09 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:51:43 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:51:45 [INFO] memberlist: Marking ip-10-170-15-71 as failed, suspect timeout reached
    2015/09/01 17:51:45 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:51:45 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:52:22 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:52:22 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:52:30 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4

Consul servers won't elect a leader

I have 3 consul servers running (+ a handful of other nodes), and they can all speak to each other - or so I think; at least they're sending UDP messages between themselves. The logs still show [ERR] agent: failed to sync remote state: No cluster leader, so even if the servers know about each other, it looks like they fail to perform an actual leader election... Is there a way to trigger a leader election manually?

I'm running consul 0.5.2 on all nodes.
Frequent membership loss for 50-100 node cluster in AWS
The main problem we're seeing is we have KV data associated with a lock, and hence a session, and the agents associated with those sessions "frequently" lose membership briefly, invalidating the associated KV data. By "frequently", I mean that in a cluster consisting of 3 consul servers and 50-100 other nodes running consul agents, we see individual isolated incidents of agent membership lost every 3-4 hours, roughly (sometimes ~30 mins apart, sometimes 7-8 hours apart, but generally seems to be randomly distributed with an "average" of 3-4).

We have a heterogeneous cluster configured as a LAN, deployed across 3 AZs in EC2. It appears that there is no correlation between the type of node and the loss of membership (e.g. if we have 10 instances of service A, and 50 of service B, we will see roughly 2 membership losses of A instances, and 10 instances of B losses in a period of a few days). It also appears to be independent of stress on the system, i.e. we see roughly the same distribution of membership loss when the system is idle as we do when it's under heavy load.

We use consul for service discovery/DNS as well as a KV store. We use the KV store specifically for the purpose of:

maintaining presence -- certain nodes maintain the presence of a key with a TTL on it to indicate they are up and running, and when those disappear another node can react and re-schedule the workload that was being performed by that node, and

locks -- for multiple instances of the same service to determine which one actually provides the service, in the case where the service needs to be a singleton

In addition to the occasional member loss, we also have the issue that when we roll the consul server cluster, which triggers a leader election, the KV store is temporarily unavailable. This potentially prevents a presence maintainer from bumping its lock before the TTL expires.

Questions:

Are there ways to configure things so that membership is more robust/more forgiving? We've seen that WAN has more conservative parameters, e.g. a larger SuspicionMult(iplicationFactor), but switching to WAN feels like it would just be papering over an issue, and would only reduce the frequency of this issue.

Is there any advice for a better way to implement "presence maintenance"? Because we use locks, and hence sessions, we're sensitive to the lossiness of UDP and probabilistic nature of Gossip, even though the running application maintaining its presence in the KV stores is running fine. E.g. is there a way to establish a lock but side-step the coupling to the membership of its agent?

What information would help diagnose this problem? We have plenty of server and agent configuration, consul server logs, consul agent logs, and application logs from services running alongside consul agents, all spanning several days.

Thanks, Amit + @matt-royal

Unable to deregister a service

I brought this to the attention of the mailing list here. @slackpad asked me to go ahead and file a bug. Below summary of the issue from the discussion thread.

We have services that are being orphaned and we cannot deregister them. The orphans show up under one or more of the master nodes. In our configuration the master nodes are dev-consul, dev-consul-s1, and dev-broker.

The health check of the orphaned node looks something like the following:

{
    "Node": "dev-consul",
    "CheckID": "service:discussion_8080",
    "Name": "Service 'discussion' check",
    "ServiceName": "discussion",
    "Notes": "",
    "Status": "critical",
    "ServiceID": "discussion_8080",
    "Output": ""
}

I attempted to deregister via:

user@dev-consul $ curl -X PUT -d '{"CheckID": "service:discussion_8080", "ServiceID": "discussion_8080", "Node": "dev-consul", "Datacenter": "dev"}' http://localhost:8500/v1/catalog/deregister

The node was removed but then reappears within 30-60s. As @slackpad's recommended, I tried deregistering with:

user@dev-consul $ curl -v http://localhost:8500/v1/agent/service/deregister/discussion_8080
user@dev-consul $ curl -v -X PUT -d'{"CheckID": "service:discussion_8080", "ServiceID": "discussion_8080", "Node": "dev-consul", "Datacenter": "dev"}' http://localhost:8500/v1/catalog/deregister

Both commands returned status 200 OK. But that also failed. You can see the output in this gist as well as the debug logs from consul.

From the debug logs in consul we see:

Aug 20 16:57:45 dev-broker consul[2221]: agent: Deregistered service 'discussion_8080'
Aug 20 16:57:45 dev-broker consul[2221]: agent: Check 'service:discussion_8080' in sync
Aug 20 16:57:45 dev-broker consul[2221]: agent: Deregistered check 'service:discussion_8080'
Aug 20 16:57:45 dev-broker consul[2221]: http: Request /v1/agent/service/deregister/discussion_8080 (19.73968ms)
Aug 20 16:57:46 dev-broker consul[2221]: http: Request /v1/agent/check/pass/service:discussion_8080, error: CheckID does not have associated TTL
Aug 20 16:57:46 dev-broker consul[2221]: http: Request /v1/agent/check/pass/service:discussion_8080 (246.298µs)
Aug 20 16:57:47 dev-broker consul[2221]: agent: Synced service 'discussion_8080' <--- SHADY!

The annotation is from @slackpad.

It's also noteworthy that the orphans are always associated with one of the master nodes (e.g. dev-consul) and not the node (dev-mesos) that's running the service that was registered. I should also mention (it could be a coincidence) that the service (discussion) is also flapping though from what I can tell from the debug logs for consul on dev-mesos everything is fine.

Our consul version:

$ consul version
Consul v0.5.2
Consul Protocol: 2 (Understands back to: 1)

Thanks!

Consul/raft consumes lot of space disk

Hello.

We are using consul 0.5.0 and on one server (we have 3 servers), consul folder is really heavy:

# du -sh *
4.0K    checkpoint-signature
3.2G    raft
24K serf
1.8G    tmp
0   ui
du -sh raft/*
3.2G    raft/mdb
4.0K    raft/peers.json
552K    raft/snapshots

If I run strings on the db file I have always this pattern:

Token
Index
Term
Type
Data
Datacenter
DirEnt
CreateIndex
Flags
5consul-alerts/checks/wax-prod.worker-1/_/memory_usage
LockIndex
ModifyIndex
Session
Value
o{"Current":"passing","CurrentTimestamp":"2015-03-19T07:16:19.750627946Z","Pending":"","PendingTimestamp":"0001-01-01T00:00:00Z","HealthCheck":{"Node":"wax-prod.worker-1","CheckID":"memory_usage","Name":"memory_usage","Status":"passing","Notes":"","Output":"MEM OK - usage system memory: 17% (free: 1405 MB)\n","ServiceID":"","ServiceName":""},"ForNotification":false}

May be, you to have to know, we are using consul-alerts too.

Consul should handle nodes changing IP addresses

Thought this was captured but couldn't find an existing issue for this. Here's a discussion - https://groups.google.com/d/msgid/consul-tool/623398ba-1dee-4851-85a2-221ff539c355%40googlegroups.com?utm_medium=email&utm_source=footer. For servers we'd also need to address https://github.com/hashicorp/consul/issues/457.

We are going to close other IP-related issues against this one to keep everything together. The Raft side should support this once you get to Raft protocol version 3, but we need to do testing and will likely have to burn down some small issues to complete this.
Agent will not start on machines without a private ip since it won't bind to any ip available.

Consul agent won't start on our machines (that by default only have a public ip assigned, they are firewalled) since it won't bind to non-private ip's by default.

A commandline option to override the behaviour of "only bind to private ip's by default" would help a lot. This option should change the current filters in consul for everything ip-related to allow any assigned ip to be used automatically.

Cluster becomes unresponsive and does not elect new leader after disk latency spike on leader

Description of the Issue (and unexpected/desired result)

Two times in a week, we've now had a situation where the leader node's VM experienced high cpu iowait levels for a few (~3) minutes, and disk latencies of 800+ milliseconds. This seems to lead to writes to the log failing and getting retried indefinitely, even after disk access times are back to normal. During this time, it spews lines to the log like consul.kvs: Apply failed: timed out enqueuing operation (and also for consul.session). For some reason, this does not trigger a leader election.

It appears incoming connections are enqueued, until all file descriptors are consumed. Restarting the Consul service seems to be the only way to recover. The non-leader servers also run out of fd:s at about the same time.

So, some questions here: Why does this not trigger a leader election already when the timeouts start happening? And why can Consul not recover after a few minutes of high disk latency?

(Regarding the reason for the iowait, we're hosted in a public cloud, and according to the provider there is a possibility for other tenants to consume high amounts of IO when booting new VMs. They are working on thottling this in a good way, but regardless, Consul should handle this better)

`consul version`

Server: v0.8.2

`consul info`

Server:

agent:
        check_monitors = 2
        check_ttls = 1
        checks = 13
        services = 14
build:
        prerelease =
        revision = 6017484
        version = 0.8.2
consul:
        bootstrap = false
        known_datacenters = 2
        leader = false
        leader_addr = 192.168.123.116:8300
        server = true
raft:
        applied_index = 249572007
        commit_index = 249572007
        fsm_pending = 0
        last_contact = 39.594µs
        last_log_index = 249572008
        last_log_term = 6062
        last_snapshot_index = 249567365
        last_snapshot_term = 6062
        latest_configuration = [{Suffrage:Voter ID:192.168.123.118:8300 Address:192.168.123.118:8300} {Suffrage:Voter ID:192.168.123.116:8300 Address:192.168.123.116:8300} {Suffrage:Voter ID:192.168.123.154:8300 Address:192.168.123.154:8300}]
        latest_configuration_index = 45140126
        num_peers = 2
        protocol_version = 2
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 6062
runtime:
        arch = amd64
        cpu_count = 4
        goroutines = 446
        max_procs = 4
        os = linux
        version = go1.8.1
serf_lan:
        encrypted = true
        event_queue = 0
        event_time = 3818
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 2318
        members = 54
        query_queue = 0
        query_time = 69
serf_wan:
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 232
        members = 6
        query_queue = 0
        query_time = 28

(This is post-restart of the server, so not sure if it is much use)

Operating system and Environment details

Centos 7.2, Openstack VM

Duplicate Node IDs after upgrading to 1.2.3

After upgrading to 1.2.3 hosts begin reporting the following when registering with the catalog. The node name for the host making the new reservation and the node name for the conflicted always match.

[ERR] consul: "Catalog.Register" RPC failed to server xxx.xxx.xxx.xxx:8300: rpc error making call: failed inserting node: Error while renaming Node ID: "4833aa15-8428-1d1a-46d8-9dba157dbc60": Node name xxx.xxx.xxx is reserved by node c5dc5b48-f105-79f0-7910-3de6629fddd0 with name xxx.xxx.xxx

Restarting consul on the host seems to resolve the issue at least temporarily. I was able to produce this on a mixed environment consisting of CentOS hosts ranging from major release 5 through 7 but most hosts on CentOS 7.

Servers can't agree on cluster leader after restart when gossiping on WAN

Hi,

I'm running consul in a all "WAN" environment, one DC. All my boxes is in the same rack, but do not have a private lan to gossip over.

The first time they join each other, with an empty /opt/consul directory they manage to join and agree on a leader.

If I restart the cluster, they still connect and find each other - but they never seem to agree on a leader

Node            Address              Status  Type    Build  Protocol
consul01        195.1xx.35.xx1:8301  alive   server  0.4.1  2
consul02        195.1xx.35.xx2:8301  alive   server  0.4.1  2
consul03        195.1xx.35.xx3:8301  alive   server  0.4.1  2

they just keep repeating 2014/11/05 13:09:41 [ERR] agent: failed to sync remote state: No cluster leader in the consul monitor output

All nodes are started with /usr/local/bin/consul agent -config-dir /etc/consul

server 1

{
  "advertise_addr": "195.1xx.35.xx1",
  "bind_addr": "195.1xx.35.xx1",
  "bootstrap_expect": 3,
  "client_addr": "0.0.0.0",
  "data_dir": "/opt/consul",
  "datacenter": "online",
  "domain": "consul",
  "log_level": "INFO",
  "ports": {
    "dns": 53
  },
  "recursor": "8.8.8.8",
  "rejoin_after_leave": true,
  "retry_join": [
    "195.1xx.35.xx1",
    "195.1xx.35.xx2",
    "195.1xx.35.xx3"
  ],
  "server": true,
  "start_join": [
    "195.1xx.35.xx1",
    "195.1xx.35.xx2",
    "195.1xx.35.xx3"
  ],
  "ui_dir": "/opt/consul/ui"
}

server 2

{
  "advertise_addr": "195.1xx.35.xx2",
  "bind_addr": "195.1xx.35.xx2",
  "bootstrap_expect": 3,
  "client_addr": "0.0.0.0",
  "data_dir": "/opt/consul",
  "datacenter": "online",
  "domain": "consul",
  "log_level": "INFO",
  "ports": {
    "dns": 53
  },
  "recursor": "8.8.8.8",
  "rejoin_after_leave": true,
  "retry_join": [
    "195.1xx.35.xx1",
    "195.1xx.35.xx2",
    "195.1xx.35.xx3"
  ],
  "server": true,
  "start_join": [
    "195.1xx.35.xx1",
    "195.1xx.35.xx2",
    "195.1xx.35.xx3"
  ],
  "ui_dir": "/opt/consul/ui"
}

server 3

{
  "advertise_addr": "195.1xx.35.xx3",
  "bind_addr": "195.1xx.35.xx3",
  "bootstrap_expect": 3,
  "client_addr": "0.0.0.0",
  "data_dir": "/opt/consul",
  "datacenter": "online",
  "domain": "consul",
  "log_level": "INFO",
  "ports": {
    "dns": 53
  },
  "recursor": "8.8.8.8",
  "rejoin_after_leave": true,
  "retry_join": [
    "195.1xx.35.xx1",
    "195.1xx.35.xx2",
    "195.1xx.35.xx3"
  ],
  "server": true,
  "start_join": [
    "195.1xx.35.xx1",
    "195.1xx.35.xx2",
    "195.1xx.35.xx3"
  ],
  "ui_dir": "/opt/consul/ui"
}

[Performance on large clusters] Performance degrades on health blocking queries to more than 682 instances

Overview of the Issue

We had two incidents where we had a high load on Consul, causing high write and read latency as well as full Consul outage. After investigation we noticed that blocking queries against our two biggest services were hitting the hardcoded watchLimit value in https://github.com/hashicorp/consul/blob/master/agent/consul/state/state_store.go#L62. watchLimit is currently set to 2048, this sets the limit on how many instances a watched service can have before hitting this limit to ~ 2048 / 3 = 681 as 3 channels are added per instance in this loop : https://github.com/hashicorp/consul/blob/master/agent/consul/state/catalog.go#L1722.

/v1/health/service/:service_name adds 3 watches per instance so the limit is ~682. /v1/catalog/service/:service_name only adds 1 watch per instance so its limit is 2048.

We confirmed this theory by cutting down one of these big services to 681 instances. This greatly reduced the load on Consul servers and restored the DC back to normal :

We later rolled out a version of Consul with this limit raised up to 8192. This showed big improvement in both latencies and load, while cutting down memory usage by ~two. We then scaled back the service we cut down to its original 711 instances with minimal impact on the cluster :

Here is a profile ran during this incident :

The hot path is in blocking query resolution. With the less fine grained watch Store.ServiceNodes() is called on virtually every change happening in the cluster, generating the load seen above.

Based on this information, hitting this limit greatly degrades Consul server's performance when the limit is slightly crossed (our biggest service has 1780 instances).

We will soon provide a PR allowing to configure this limit.

Reproduction Steps

Register a service with more than 682 instances. Generate some load and run /health blocking queries against this service with many clients, in our case around ~2K.

Consul info for both Client and Server

Server and client v1.2.3 with patches.

Operating system and Environment details

Windows and Linux.

grpc: switch servers and retry on error

Description

This is the OSS portion of enterprise PR 3822, it has been reviewed thoroughly there.

It adds a custom gRPC balancer that replicates the router's server cycling behavior. It also enables automatic retries for RESOURCE_EXHAUSTED errors, which we now get for free.

Testing & Reproduction steps

The balancer package has unit tests that spin up real gRPC servers and clients.

Also manually tested in enterprise by:

Hacking the partition read endpoint to return RESOURCE_EXHAUSTED if req.Name != NodeName
Running two servers and a client agent
Running consul partition read server1 and consul partition read server2
Watching the agent logs and observing:

2022-12-09T12:15:45.850Z [TRACE] agent.grpc.balancer: witnessed RPC error: target=consul://dc1.7105466e-b7d7-8240-a500-88a14461aa49/server.dc1 server=dc1-127.0.0.1:9102 error="rpc error: code = ResourceExhausted desc = you got rate limited, my dude"
2022-12-09T12:15:45.850Z [DEBUG] agent.grpc.balancer: switching server: target=consul://dc1.7105466e-b7d7-8240-a500-88a14461aa49/server.dc1 from=dc1-127.0.0.1:9102 to=dc1-127.0.0.1:9101
2022-12-09T12:15:45.850Z [TRACE] agent.grpc.balancer: sub-connection state changed: target=consul://dc1.7105466e-b7d7-8240-a500-88a14461aa49/server.dc1 server=dc1-127.0.0.1:9101 state=CONNECTING
2022-12-09T12:15:45.851Z [TRACE] agent.grpc.balancer: sub-connection state changed: target=consul://dc1.7105466e-b7d7-8240-a500-88a14461aa49/server.dc1 server=dc1-127.0.0.1:9101 state=READY
2022-12-09T12:16:12.049Z [ERROR] agent.http: Request error: method=GET url=/v1/partition/foo from=127.0.0.1:62096 error="Partition not found for \"server2\""

Note that the balancer automatically switched connections and retried against the other server 🙌🏻

emit metrics for global rate limiting

Description

Emits metrics for global rate limiting implementation.

Testing & Reproduction steps

TODO: unit tests
Manually exceeded rate limit to v1/catalog/nodes via curl and then queried v1/agent/metrics endpoint:

        {
            "Name": "consul.consul.rate_limit",
            "Count": 27,
            "Rate": 2.7,
            "Sum": 27,
            "Min": 1,
            "Max": 1,
            "Mean": 1,
            "Stddev": 0,
            "Labels": {
                "limit_type": "global/read",
                "mode": "enforcing",
                "op": "Catalog.ListNodes"
            }
        },

PR Checklist

[ ] updated test coverage
[ ] external facing docs updated
[ ] not a security concern

docs: Consul at scale guide
Description

A guide to deploying Consul at scale was drafted for inclusion in our docs. This PR stages the guide in the "Architecture" section, as the guide is primarily concerned with how deployments at scale have performance impacts as a result of the architectural design, and provides architectural recommendations.

Please note two unresolved questions from the drafting process. @jkirschner-hashicorp: could you weigh in on my two comments with unresolved questions before we merge?

Links

Draft on Google Docs

Slack channel

Deployed Preview

PR Checklist

[ ] updated test coverage

[X] external facing docs updated

[X] not a security concern
chore(deps-dev): bump husky from 4.3.8 to 8.0.3 in /website
Bumps husky from 4.3.8 to 8.0.3.

Release notes

Sourced from husky's releases.

v8.0.3

fix: add git not installed message #1208

v8.0.2

docs: remove deprecated npm set-script

v8.0.1

fix: use POSIX equality operator

v8.0.0

What's Changed

Feats

feat: add husky - prefix to logged global error messages by @joshbalfour in typicode/husky#1092

feat: show PATH when command not found to improve debuggability

feat: drop Node 12 support

feat: skip install if $HUSKY=0

Fixes

fix: hook script use /usr/bin/env sh instead of direct path of sh by @skhaz in typicode/husky#1051

fix: actually set 'husky_skip_init' as readonly in ./husky.sh by @hyperupcall in typicode/husky#1104

fix: force basename/dirname to treat $0 as an argument by @mataha in typicode/husky#1132

fix: remove git.io links by @renbaoshuo in typicode/husky#1136

Docs

docs: fix uninstall via npm by @pddpd in typicode/husky#1033

docs: add dog emoji as favicon by @jamiehaywood in typicode/husky#1095

docs: replace deprecated npx --no-install option with npx --no by @sibiraj-s in typicode/husky#1070

docs: add pnpm installation by @MohamadKh75 in typicode/husky#1139

Chore

chore: update workflows by @tiziodcaio in typicode/husky#1125

v7.0.4

No changes. Husky v7.0.3 was reverted, this version is the same as v7.0.2.

v7.0.2

Fix pre-commit hook in WebStorm (#1023)

v7.0.1

Fix gracefully fail if Git command is not found #1003 (same as in v6)

v7.0.0

Improve .husky/ directory structure. .husky/.gitignore is now unnecessary and can be removed.

Improve error output (shorter)

Update husky-init CLI

Update husky-4-to-7 CLI

Drop Node 10 support

... (truncated)

Commits

3c0e08d 8.0.3

1ed3f9a fix: change message

500d450 fix: add git not installed error message (#1208)

2945907 chore(deps): bump minimatch from 3.0.4 to 3.1.2 (#1229)

9f6dac4 chore: remove stale bot

f6c2c06 chore: update package-lock.json

f862dc2 chore: update devDependencies

9efb720 8.0.2

573de60 docs: remove deprecated npm set-script

3db28d4 chore: increase daysUntilStale

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
docs: fix broken links
Change Consul tutorials broken links

Description

Fix Consul tutorials broken links.

PR Checklist

[ ] updated test coverage

[x] external facing docs updated

[x] not a security concern
Refactoring the peering integ test to accommodate coming changes of o…
Description

Refactoring the peering test in container test to accommodate coming changes of other upgrade scenarios.

Add a utils package under consul-containers/test that contains methods to set up various test scenarios. For example, BasicPeeringTwoClustersSetup is used by peering test.

Deduplication: have a single CreatingPeeringClusterAndSetup replace CreatingAcceptingClusterAndSetup and CreateDialingClusterAndSetup.

Separate peering cluster creation and server registration. Previously, the CreatingAcceptingClusterAndSetup or CreateDialingClusterAndSetup both create clusters and register a service. The updated code run these steps sequentially in BasicPeeringTwoClustersSetup.

Testing & Reproduction steps

Links

PR Checklist

[ ] updated test coverage

[ ] external facing docs updated

[x] not a security concern

Distributed reliable key-value store for the most critical data of a distributed system

etcd Note: The master branch may be in an unstable or even broken state during development. Please use releases instead of the master branch in order

Dec 28, 2022

HA LDAP based key/value solution for projects configuration storing with multi master replication support

Recon is the simple solution for storing configs of you application. There are no specified instruments, no specified data protocols. For the full power of Recon you only need curl.

Jun 15, 2022

Distributed cache and in-memory key/value data store. It can be used both as an embedded Go library and as a language-independent service.

Olric Distributed cache and in-memory key/value data store. It can be used both as an embedded Go library and as a language-independent service. With

Jan 4, 2023

GhostDB is a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.

GhostDB is designed to speed up dynamic database or API driven websites by storing data in RAM in order to reduce the number of times an external data source such as a database or API must be read. GhostDB provides a very large hash table that is distributed across multiple machines and stores large numbers of key-value pairs within the hash table.

Jan 6, 2023

The Consul API Gateway is a dedicated ingress solution for intelligently routing traffic to applications running on a Consul Service Mesh.

Dec 14, 2022

BlobStore is a highly reliable,highly available and ultra-large scale distributed storage system

BlobStore Overview Documents Build BlobStore Deploy BlobStore Manage BlobStore License Overview BlobStore is a highly reliable,highly available and ul

Oct 10, 2022

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.

Consul

Quick Start

Documentation

Contributing

Owner

HashiCorp

Comments

Node health flapping - EC2

Consul servers won't elect a leader

Frequent membership loss for 50-100 node cluster in AWS

Unable to deregister a service

Consul/raft consumes lot of space disk

Consul should handle nodes changing IP addresses

Agent will not start on machines without a private ip since it won't bind to any ip available.

Cluster becomes unresponsive and does not elect new leader after disk latency spike on leader

Description of the Issue (and unexpected/desired result)

consul version

consul info

Operating system and Environment details

Duplicate Node IDs after upgrading to 1.2.3

Servers can't agree on cluster leader after restart when gossiping on WAN

[Performance on large clusters] Performance degrades on health blocking queries to more than 682 instances

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

grpc: switch servers and retry on error

Description

Testing & Reproduction steps

emit metrics for global rate limiting

Description

Testing & Reproduction steps

PR Checklist

docs: Consul at scale guide

Description

Links

PR Checklist

chore(deps-dev): bump husky from 4.3.8 to 8.0.3 in /website

v8.0.3

v8.0.2

v8.0.1

v8.0.0

What's Changed

Feats

Fixes

Docs

Chore

v7.0.4

v7.0.2

v7.0.1

v7.0.0

docs: fix broken links

Description

PR Checklist

Refactoring the peering integ test to accommodate coming changes of o…

Description

Testing & Reproduction steps

Links

PR Checklist

Related tags

Distributed reliable key-value store for the most critical data of a distributed system

HA LDAP based key/value solution for projects configuration storing with multi master replication support

Distributed cache and in-memory key/value data store. It can be used both as an embedded Go library and as a language-independent service.

GhostDB is a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.

An in-memory key:value store/cache (similar to Memcached) library for Go, suitable for single-machine applications.

Distributed disk storage database based on Raft and Redis protocol.

Distributed, fault-tolerant key-value storage written in go.

Distributed key-value store

rgkv is a distributed kv storage service using raft consensus algorithm.

A simple distributed kv system from scratch

Simple Distributed key-value database (in-memory/disk) written with Golang.

Implementation of distributed key-value system based on TiKV

Kdmq - Tool to query KDM data for a given Rancher version

A key-value db api with multiple storage engines and key generation

CrankDB is an ultra fast and very lightweight Key Value based Document Store.

NutsDB a simple, fast, embeddable and persistent key/value store written in pure Go.

rosedb is a fast, stable and embedded key-value (k-v) storage engine based on bitcask.

KV - a toy in-memory key value store built primarily in an effort to write more go and check out grpc

The Consul API Gateway is a dedicated ingress solution for intelligently routing traffic to applications running on a Consul Service Mesh.

`consul version`

`consul info`