Hi there,
I’m running into an issue with Consul where I am not able to be fault tolerant
in a 5-node server setup.
I’m running Consul setup on several Kubernetes clusters, using a
configuration inspired by the helm configuration but adjusted for our needs.
As an example, one cluster performs as expected and correctly performs leader
election on node failure (or rolling pod restarts) and does not give me much trouble. I have
a similar setup on a larger cluster with a much higher workload, and that’s
where I’m starting to run into issues where that cluster fails to run healthy
nodes. In particular, no node is able to get Voter status after joining the
cluster:
# output of `consul operator raft list-peers`:
State Voter RaftProtocol
leader true 3
follower false 3
follower false 3
follower false 3
follower false 3
When restarting the servers, the leader drops and the clusters do not elect a
new leader, but rather wait till the new leader joins.
Doing a curl healthcheck on autopilot, I see that all servers are marked as
unhealthy, except the leader, but otherwise have the same status and indices.
An unhealthy, but otherwise fully responsive node roughly looks like this in the
autopilot health check (curl localhost:8500/v1/operator/autopilot/health | jq .
):
{
"ID": "...",
"Name": "...",
"Address": "...",
"SerfStatus": "alive",
"Version": "1.13.3",
"Leader": false,
"LastContact": "4.947865ms",
"LastTerm": 4254,
"LastIndex": 616049522,
"Healthy": false,
"Voter": false,
"StableSince": "2022-11-11T04:51:26Z"
},
and the single healthy node comes up with this info:
{
"ID": "...",
"Name": "...",
"Address": "...",
"SerfStatus": "alive",
"Version": "1.13.3",
"Leader": true,
"LastContact": "0s",
"LastTerm": 4254,
"LastIndex": 616049522,
"Healthy": true,
"Voter": true,
"StableSince": "2022-11-11T04:51:26Z"
},
One way to get everything in a temporary state of each node being able to vote
for a leader is to do a peers.json setup.
When doing this however, even if the nodes are Voters, they are still failing
to get Healthy. If I perform any maintenance with the Consul nodes (say a
rolling restart of nodes) afterwards, the restarted nodes fail to get Voter
status and I’m back with a single Healthy leader and Voter.
I’m wondering why my nodes fail to become healthy. I’ve tried adjusting
autopilot config, leaving on termination, raft performance, adjusting timeouts
back and forth, but no knobs I could twist would help me. In general this is
reproducible even with the most barebones of configurations:
exec /bin/consul agent -server \
-advertise="${POD_IP}" \
-bind=0.0.0.0 \
-bootstrap-expect=5
-client=0.0.0.0 \
-server \
-ui \
-retry-join=${DNS_OF_STATEFULSET_INSTANCE-0} \
-retry-join=${DNS_OF_STATEFULSET_INSTANCE-1} \
-retry-join=${DNS_OF_STATEFULSET_INSTANCE-2} \
-retry-join=${DNS_OF_STATEFULSET_INSTANCE-3} \
-retry-join=${DNS_OF_STATEFULSET_INSTANCE-4} \
Logs occasionally (very rarely) mention about some connection failures, but they
don’t persist. Logs in general seem to agree that everything is fine after a
restart:
2022-11-11T04:29:37.148Z [INFO] agent: Synced node info
2022-11-11T04:29:37.183Z [INFO] agent: (LAN) joined: number_of_nodes=5
2022-11-11T04:29:37.183Z [INFO] agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=5
Additionally what fully perplexes me is that the exact same setup runs on a
lower-load cluster and does not exhibit this behavior, as mentioned earlier.
Is there a way to get additional meta-information to understand why my servers
will not go into a healthy state? Any other performance or timeout knobs I can
tune? Any help is very much appreciated.