How to configure consul server to garbage collect dead nodes from consul catalog?

wfeng-fsde · February 28, 2021, 12:26am

We are currently running consul 1.9.3 but we have been using consul since v1.5.x. It could be because of our misconfiguration, but I noticed that our consul servers do not garbage collect the dead consul nodes (clients) from consul catalog. Recently I was doing destructive testing on zookeeper where I was killing servers constantly for days, and all those dead servers (which had consul client on them when they were alive) are still visible in consul server catalog after days. I can force clean them by doing a rolling restart of consul servers.

I tried to find any configuration in the doc that might control this with no luck. So my question is: is there such as config that I can set? How can I ensure that consul servers remove dead clients in a reasonable time frame?

(BTW, I think this might be a consul bug, for our zookeeper, we have 2 custom health checks that are registered in consul. I’ve noticed that after I kill a healthy node, the Serf Health Status show that Agent not alive or reachable. However, the 2 custom health checks still show as passing in consul for this node. Could that be the reason that consul is not removing the node from catalog?)

Wolfsrudel · February 28, 2021, 9:40am

First thing that comes into my mind is the autopilot.

What does

consul operator autopilot get-config

say?

wfeng-fsde · February 28, 2021, 5:22pm

this is what I got:

# consul operator autopilot get-config
CleanupDeadServers = true
LastContactThreshold = 200ms
MaxTrailingLogs = 250
MinQuorum = 0
ServerStabilizationTime = 10s
RedundancyZoneTag = ""
DisableUpgradeMigration = false
UpgradeVersionTag = ""

wfeng-fsde · March 3, 2021, 11:55pm

any ideas how I can fix this without having to rotate consul servers? I’ve got 300+ dead nodes in my zookeeper cluster catalog at the moment

Ranjandas · March 4, 2021, 1:10am

Hi @wfeng-fsde,

Just trying to understand this better. Are you seeing that the servers in failed state are still showing up in the catalog even after 3 days (72 hours)? The default reconnect_timeout is 72 hours.

Considering that your failed nodes are client agents, you could try setting the advertise_reconnect_timeout. This is configurable per agent and for client agents only.

I tried this config and I could see that the nodes are getting reaped close to the value configured for advertise_reconnect_timeout.

2021-03-04T11:59:54.783+1100 [DEBUG] agent.server.memberlist.lan: memberlist: Failed ping: dc1-cli-6 (timeout reached)
2021-03-04T11:59:55.295+1100 [INFO]  agent.server.memberlist.lan: memberlist: Suspect dc1-cli-6 has failed, no acks received
2021-03-04T11:59:58.446+1100 [INFO]  agent.server.memberlist.lan: memberlist: Marking dc1-cli-6 as failed, suspect timeout reached (2 peer c
onfirmations)                                                                                                                               2021-03-04T11:59:58.446+1100 [INFO]  agent.server.serf.lan: serf: EventMemberFailed: dc1-cli-6 192.168.42.16
2021-03-04T11:59:58.447+1100 [INFO]  agent.server: member failed, marking health critical: member=dc1-cli-6
2021-03-04T11:59:58.736+1100 [DEBUG] agent.server.memberlist.wan: memberlist: Stream connection from=192.168.42.2:51918
2021-03-04T12:00:07.382+1100 [DEBUG] agent.server.serf.lan: serf: forgoing reconnect for random throttling
2021-03-04T12:00:07.390+1100 [INFO]  agent.server.serf.lan: serf: EventMemberReap: dc1-cli-6
2021-03-04T12:00:07.393+1100 [INFO]  agent.server: deregistering member: member=dc1-cli-6 reason=reaped

In the above example, I had configured advertise_reconnect_timeout = "1s".

I am keen to know whether this fixes your problem.

Also have a look at this thread if you haven’t already: Configure time to cleanup failed consul clients when working with AWS Spot instances · Issue #2982 · hashicorp/consul · GitHub

wfeng-fsde · March 4, 2021, 4:59am

@Ranjandas Thanks a lot for that tip. It looks to be the right way to go. However, when I tried it by putting it in my config file:

datacenter = "my-data-center"
primary_datacenter = "my-primary-data-center"
server = false
...
advertise_reconnect_timeout = "60s"
retry_join = ["<my-consul-dns>"]
ports {
  grpc = 8502
}
connect {
  enabled = true
}
...

I got the following error:

$ /usr/local/bin/consul agent -ui -config-dir local/config
==> Error parsing local/config/config.hcl: 1 error occurred:
* invalid config key advertise_reconnect_timeout

I’ve read the doc over again and can’t figure out what I am doing wrong here.

Ranjandas · March 4, 2021, 5:39am

What version of Consul are you running? I tried this on 1.9.3 and seems to be working without any issues.

wfeng-fsde · March 14, 2021, 10:36pm

Sorry my bad. I thought I was running 1.9.3, but was actually running 1.8.3. Switched to 1.9.3 and everything works.

Topic		Replies	Views
Consul Agent Nodes Don't Deregister When "exit" Consul	1	373	January 27, 2021
Ghost nodes showing in http://localhost:8500/v1/health/service/whatever (but not in consul members) Consul	1	626	February 2, 2022
Service cannot deregister with dead node Consul	5	120	September 4, 2024
How does consul know about previous agent / nodes? Consul	2	507	October 13, 2021
Ghost agent only appears in the catalog Consul	1	221	September 14, 2022

How to configure consul server to garbage collect dead nodes from consul catalog?

Related topics