Unable to make fault-tolerant 5 node Consul server setup

Hi there,

I’m running into an issue with Consul where I am not able to be fault tolerant
in a 5-node server setup.

I’m running Consul setup on several Kubernetes clusters, using a
configuration inspired by the helm configuration but adjusted for our needs.
As an example, one cluster performs as expected and correctly performs leader
election on node failure (or rolling pod restarts) and does not give me much trouble. I have
a similar setup on a larger cluster with a much higher workload, and that’s
where I’m starting to run into issues where that cluster fails to run healthy
nodes. In particular, no node is able to get Voter status after joining the
cluster:

# output of `consul operator raft list-peers`:
State     Voter  RaftProtocol
leader    true   3
follower  false  3
follower  false  3
follower  false  3
follower  false  3

When restarting the servers, the leader drops and the clusters do not elect a
new leader, but rather wait till the new leader joins.

Doing a curl healthcheck on autopilot, I see that all servers are marked as
unhealthy, except the leader, but otherwise have the same status and indices.

An unhealthy, but otherwise fully responsive node roughly looks like this in the
autopilot health check (curl localhost:8500/v1/operator/autopilot/health | jq .):

{
      "ID": "...",
      "Name": "...",
      "Address": "...",
      "SerfStatus": "alive",
      "Version": "1.13.3",
      "Leader": false,
      "LastContact": "4.947865ms",
      "LastTerm": 4254,
      "LastIndex": 616049522,
      "Healthy": false,
      "Voter": false,
      "StableSince": "2022-11-11T04:51:26Z"
    },

and the single healthy node comes up with this info:

{
      "ID": "...",
      "Name": "...",
      "Address": "...",
      "SerfStatus": "alive",
      "Version": "1.13.3",
      "Leader": true,
      "LastContact": "0s",
      "LastTerm": 4254,
      "LastIndex": 616049522,
      "Healthy": true,
      "Voter": true,
      "StableSince": "2022-11-11T04:51:26Z"
    },

One way to get everything in a temporary state of each node being able to vote
for a leader is to do a peers.json setup.
When doing this however, even if the nodes are Voters, they are still failing
to get Healthy. If I perform any maintenance with the Consul nodes (say a
rolling restart of nodes) afterwards, the restarted nodes fail to get Voter
status and I’m back with a single Healthy leader and Voter.

I’m wondering why my nodes fail to become healthy. I’ve tried adjusting
autopilot config, leaving on termination, raft performance, adjusting timeouts
back and forth, but no knobs I could twist would help me. In general this is
reproducible even with the most barebones of configurations:

exec /bin/consul agent -server \
  -advertise="${POD_IP}" \
  -bind=0.0.0.0 \
  -bootstrap-expect=5
  -client=0.0.0.0 \
  -server \
  -ui \
  -retry-join=${DNS_OF_STATEFULSET_INSTANCE-0} \
  -retry-join=${DNS_OF_STATEFULSET_INSTANCE-1} \
  -retry-join=${DNS_OF_STATEFULSET_INSTANCE-2} \
  -retry-join=${DNS_OF_STATEFULSET_INSTANCE-3} \
  -retry-join=${DNS_OF_STATEFULSET_INSTANCE-4} \

Logs occasionally (very rarely) mention about some connection failures, but they
don’t persist. Logs in general seem to agree that everything is fine after a
restart:

2022-11-11T04:29:37.148Z [INFO]  agent: Synced node info
2022-11-11T04:29:37.183Z [INFO]  agent: (LAN) joined: number_of_nodes=5
2022-11-11T04:29:37.183Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=5

Additionally what fully perplexes me is that the exact same setup runs on a
lower-load cluster and does not exhibit this behavior, as mentioned earlier.

Is there a way to get additional meta-information to understand why my servers
will not go into a healthy state? Any other performance or timeout knobs I can
tune? Any help is very much appreciated.

Please use the consul operator autopilot get-config command - this will show the thresholds that are being used to determine healthiness. Also, consul operator autopilot state will show you nicely formatted information about your nodes, rather than needing jq . and directly reading the JSON.

Those autopilot configs are currently the defaults for 1.13 - unfortunately neither command gives me any introspection into why the nodes are unhealthy, example output:

node-uuid-...:
      Name:            ...
      Address:         ...:8300
      Version:         1.13.3
      Status:          non-voter
      Node Type:       voter
      Node Status:     alive
      Healthy:         false
      Last Contact:    19.768936ms
      Last Term:       4255
      Last Index:      617416869
      Meta
         "consul-network-segment": ""

while it shares the last term and index with the leader who is alive, healthy and a voter.

Please show them anyway. I’m trying to prove they haven’t been accidentally reset to zeros, as I have seen happen.

The reasons why nodes can be considered unhealthy are documented at Autopilot | Consul | HashiCorp Developer so the important thing is to determine which condition is failing.

2 Likes

Oh they are indeed zeros, I thought that just meant that they’re going to be the defaults for those values? I will try to override them with the defaults from the docs, see what happens.

CleanupDeadServers = true
LastContactThreshold = 0s
MaxTrailingLogs = 0
MinQuorum = 0
ServerStabilizationTime = 0s
RedundancyZoneTag = ""
DisableUpgradeMigration = false
UpgradeVersionTag = ""

Appreciate your help!

Yes! That was indeed the issue!

I first set up the operator config via the command line, and also set the configs in the config file, and everything basically immediately fixed itself.

Double checking, I now see that our lower-load cluster did indeed not have its values reset like this.

Very much appreciate your help!