We are currently running consul 1.9.3 but we have been using consul since v1.5.x. It could be because of our misconfiguration, but I noticed that our consul servers do not garbage collect the dead consul nodes (clients) from consul catalog. Recently I was doing destructive testing on zookeeper where I was killing servers constantly for days, and all those dead servers (which had consul client on them when they were alive) are still visible in consul server catalog after days. I can force clean them by doing a rolling restart of consul servers.
I tried to find any configuration in the doc that might control this with no luck. So my question is: is there such as config that I can set? How can I ensure that consul servers remove dead clients in a reasonable time frame?
(BTW, I think this might be a consul bug, for our zookeeper, we have 2 custom health checks that are registered in consul. I’ve noticed that after I kill a healthy node, the
Serf Health Status show that
Agent not alive or reachable. However, the 2 custom health checks still show as passing in consul for this node. Could that be the reason that consul is not removing the node from catalog?)
First thing that comes into my mind is the autopilot.
consul operator autopilot get-config
this is what I got:
# consul operator autopilot get-config
CleanupDeadServers = true
LastContactThreshold = 200ms
MaxTrailingLogs = 250
MinQuorum = 0
ServerStabilizationTime = 10s
RedundancyZoneTag = ""
DisableUpgradeMigration = false
UpgradeVersionTag = ""
any ideas how I can fix this without having to rotate consul servers? I’ve got 300+ dead nodes in my zookeeper cluster catalog at the moment
Just trying to understand this better. Are you seeing that the servers in
failed state are still showing up in the catalog even after 3 days (72 hours)? The default reconnect_timeout is 72 hours.
Considering that your failed nodes are
client agents, you could try setting the advertise_reconnect_timeout. This is configurable per agent and for
client agents only.
I tried this config and I could see that the nodes are getting reaped close to the value configured for
2021-03-04T11:59:54.783+1100 [DEBUG] agent.server.memberlist.lan: memberlist: Failed ping: dc1-cli-6 (timeout reached)
2021-03-04T11:59:55.295+1100 [INFO] agent.server.memberlist.lan: memberlist: Suspect dc1-cli-6 has failed, no acks received
2021-03-04T11:59:58.446+1100 [INFO] agent.server.memberlist.lan: memberlist: Marking dc1-cli-6 as failed, suspect timeout reached (2 peer c
onfirmations) 2021-03-04T11:59:58.446+1100 [INFO] agent.server.serf.lan: serf: EventMemberFailed: dc1-cli-6 192.168.42.16
2021-03-04T11:59:58.447+1100 [INFO] agent.server: member failed, marking health critical: member=dc1-cli-6
2021-03-04T11:59:58.736+1100 [DEBUG] agent.server.memberlist.wan: memberlist: Stream connection from=192.168.42.2:51918
2021-03-04T12:00:07.382+1100 [DEBUG] agent.server.serf.lan: serf: forgoing reconnect for random throttling
2021-03-04T12:00:07.390+1100 [INFO] agent.server.serf.lan: serf: EventMemberReap: dc1-cli-6
2021-03-04T12:00:07.393+1100 [INFO] agent.server: deregistering member: member=dc1-cli-6 reason=reaped
In the above example, I had configured
advertise_reconnect_timeout = "1s".
I am keen to know whether this fixes your problem.
Also have a look at this thread if you haven’t already: Configure time to cleanup failed consul clients when working with AWS Spot instances · Issue #2982 · hashicorp/consul · GitHub
@Ranjandas Thanks a lot for that tip. It looks to be the right way to go. However, when I tried it by putting it in my config file:
datacenter = "my-data-center"
primary_datacenter = "my-primary-data-center"
server = false
advertise_reconnect_timeout = "60s"
retry_join = ["<my-consul-dns>"]
grpc = 8502
enabled = true
I got the following error:
$ /usr/local/bin/consul agent -ui -config-dir local/config
==> Error parsing local/config/config.hcl: 1 error occurred:
* invalid config key advertise_reconnect_timeout
I’ve read the doc over again and can’t figure out what I am doing wrong here.
What version of Consul are you running? I tried this on
1.9.3 and seems to be working without any issues.
Sorry my bad. I thought I was running
1.9.3, but was actually running
1.8.3. Switched to
1.9.3 and everything works.