Failed leadership election with three node cluster in GKE (Consul v1.5.2)

Hi there.

Yesterday we experienced failed cluster leadership election during a routine node pool upgrade in GKE. This left the cluster without a leader for 30+ minutes.

The timeline of events is as follows:

2023-02-14 07:40:12.638 consul-consul-server-2 2023/02/14 07:40:12 [INFO] agent: Caught signal: terminated — (consul-server-2 goes down (leader)
2023-02-14 07:40:15.218 consul-consul-server-1 2023/02/14 07:40:15 [INFO] consul: New leader elected: consul-consul-server-1 — consul-server-1 becomes leader
2023-02-14 07:46:59.824 consul-consul-server-2 2023/02/14 07:46:59 [INFO] agent: (LAN) joined: 3 — consul-server-2 comes back online
2023-02-14 07:47:01.233 consul-consul-server-1 2023/02/14 07:47:01 [INFO] agent: Caught signal: terminated – consul-server-1 goes down (leader)
2023-02-14 07:47:09.709 consul-consul-server-0 2023/02/14 07:47:09 [WARN] raft: Election timeout reached, restarting election — consul-server-0 is unable to elect a new leader as it is not talking to consul-server-2

It looks like from the logs consul-server-0 and consul-server-2 were unable to elect a leader before consul-server-1 went down.

What we don’t understand at this point is how the cluster got itself into this state.

Autopilot config:

CleanupDeadServers = true
LastContactThreshold = 200ms
MaxTrailingLogs = 250
ServerStabilizationTime = 10s
RedundancyZoneTag = “”
DisableUpgradeMigration = false
UpgradeVersionTag = “”

Full logs:

"consul-consul-server-0","    2023/02/14 07:47:08 [WARN]  raft: Election timeout reached, restarting election"
"consul-consul-server-1","    2023/02/14 07:47:07 [ERR] agent: Coordinate update error: No cluster leader"
"consul-consul-server-1","    2023/02/14 07:47:07 [WARN] consul: error getting server health from ""consul-consul-server-2"": rpc error getting client: failed to get conn: dial tcp <nil>->10.239.114.198:8300: i/o timeout"
"consul-consul-server-0","    2023/02/14 07:47:06 [INFO]  raft: Node at 10.239.113.198:8300 [Candidate] entering Candidate state in term 1621"
"consul-consul-server-0","    2023/02/14 07:47:06 [WARN]  raft: Election timeout reached, restarting election"
"consul-consul-server-2","    2023/02/14 07:47:05 [INFO] memberlist: Suspect consul-consul-server-1 has failed, no acks received"
"consul-consul-server-0","    2023/02/14 07:47:05 [INFO]  raft: Node at 10.239.113.198:8300 [Candidate] entering Candidate state in term 1620"
"consul-consul-server-0","    2023/02/14 07:47:05 [WARN]  raft: Election timeout reached, restarting election"
"consul-consul-server-0","    2023/02/14 07:47:04 [INFO]  raft: Node at 10.239.113.198:8300 [Candidate] entering Candidate state in term 1619"
"consul-consul-server-0","    2023/02/14 07:47:04 [WARN]  raft: Election timeout reached, restarting election"
"consul-consul-server-0","    2023/02/14 07:47:02 [ERROR] raft: Failed to make RequestVote RPC to {Voter e6758c6f-e4ca-cb09-62a6-ff6a209747f8 10.239.112.134:8300}: EOF"
"consul-consul-server-1","    2023/02/14 07:47:02 [ERR] raft-net: Failed to decode incoming command: transport shutdown"
"consul-consul-server-0","    2023/02/14 07:47:02 [INFO]  raft: Node at 10.239.113.198:8300 [Candidate] entering Candidate state in term 1618"
"consul-consul-server-0","    2023/02/14 07:47:02 [WARN]  raft: Heartbeat timeout from ""10.239.112.134:8300"" reached, starting election"
"consul-consul-server-1","    2023/02/14 07:47:01 [INFO]  raft: aborting pipeline replication to peer {Nonvoter 795d24ab-3d36-05dd-046c-393be5c94ed3 10.239.114.135:8300}"
"consul-consul-server-1","    2023/02/14 07:47:01 [INFO]  raft: aborting pipeline replication to peer {Voter 4c238024-cf5f-cff9-0e6f-028eba511645 10.239.113.198:8300}"
"consul-consul-server-1","    2023/02/14 07:47:01 [INFO] manager: shutting down"
"consul-consul-server-1","    2023/02/14 07:47:01 [WARN] serf: Shutdown without a Leave"
"consul-consul-server-1","    2023/02/14 07:47:01 [WARN] serf: Shutdown without a Leave"
"consul-consul-server-1","    2023/02/14 07:47:01 [INFO] consul: shutting down server"
"consul-consul-server-1","    2023/02/14 07:47:01 [INFO] agent: Requesting shutdown"
"consul-consul-server-1","    2023/02/14 07:47:01 [INFO] agent: Graceful shutdown disabled. Exiting"
"consul-consul-server-1","    2023/02/14 07:47:01 [INFO] agent: Caught signal:  terminated"
"consul-consul-server-1","    2023/02/14 07:47:00 [INFO]  raft: pipelining replication to peer {Nonvoter 795d24ab-3d36-05dd-046c-393be5c94ed3 10.239.114.135:8300}"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] agent: Synced node info"
"consul-consul-server-1","    2023/02/14 07:46:59 [WARN]  raft: AppendEntries to {Nonvoter 795d24ab-3d36-05dd-046c-393be5c94ed3 10.239.114.135:8300} rejected, sending older logs (next: 25953103)"
"consul-consul-server-2","    2023/02/14 07:46:59 [WARN]  raft: Failed to get previous log: 25953359 log not found (last: 25953102)"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO] consul: member 'consul-consul-server-2' joined, marking health alive"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO]  raft: Added peer 795d24ab-3d36-05dd-046c-393be5c94ed3, starting replication"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO]  raft: Updating configuration with AddNonvoter (795d24ab-3d36-05dd-046c-393be5c94ed3, 10.239.114.135:8300) to [{Suffrage:Voter ID:e6758c6f-e4ca-cb09-62a6-ff6a209747f8 Address:10.239.112.134:8300} {Suffrage:Voter ID:4c238024-cf5f-cff9-0e6f-028eba511645 Address:10.239.113.198:8300} {Suffrage:Nonvoter ID:795d24ab-3d36-05dd-046c-393be5c94ed3 Address:10.239.114.135:8300}]"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO] consul: removed server with duplicate ID: 795d24ab-3d36-05dd-046c-393be5c94ed3"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] agent: Join LAN completed. Synced with 3 initial agents"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] agent: (LAN) joined: 3"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO]  raft: Removed peer 795d24ab-3d36-05dd-046c-393be5c94ed3, stopping replication after 25953358"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO]  raft: Updating configuration with RemoveServer (795d24ab-3d36-05dd-046c-393be5c94ed3, ) to [{Suffrage:Voter ID:e6758c6f-e4ca-cb09-62a6-ff6a209747f8 Address:10.239.112.134:8300} {Suffrage:Voter ID:4c238024-cf5f-cff9-0e6f-028eba511645 Address:10.239.113.198:8300}]"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO] consul: Adding LAN server consul-consul-server-2 (Addr: tcp/10.239.114.135:8300) (DC: <redacted>)"
"consul-consul-server-1","    2023/02/14 07:46:59 [INFO] serf: EventMemberJoin: consul-consul-server-2 10.239.114.135"
"consul-consul-server-0","    2023/02/14 07:46:59 [INFO] consul: Adding LAN server consul-consul-server-2 (Addr: tcp/10.239.114.135:8300) (DC: <redacted>)"
"consul-consul-server-0","    2023/02/14 07:46:59 [INFO] serf: EventMemberJoin: consul-consul-server-2 10.239.114.135"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] consul: Adding LAN server consul-consul-server-1 (Addr: tcp/10.239.112.134:8300) (DC: <redacted>)"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] consul: Adding LAN server consul-consul-server-0 (Addr: tcp/10.239.113.198:8300) (DC: <redacted>)"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] serf: EventMemberJoin: consul-consul-server-1 10.239.112.134"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] agent: (LAN) joining: [consul-consul-server-0.consul-consul-server.consul.svc:8301 consul-consul-server-1.consul-consul-server.consul.svc:8301 consul-consul-server-2.consul-consul-server.consul.svc:8301]"
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] agent: Joining LAN cluster..."
"consul-consul-server-2","    2023/02/14 07:46:59 [INFO] agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce k8s mdns os packet scaleway softlayer triton vsphere"
"consul-consul-server-2","==> Consul agent running!"

I do need to begin by pointing out that Consul 1.5 is an unreasonably ancient version to still be running these days. That aside…

The problem here is that consul-server-1 was taken down at a point where it was critical to maintain the quorum.

Raft incorporates various features to try to keep the cluster healthy. Sometimes these features help… and at other times they backfire horribly, creating the outage they were intended to prevent.

In this case the feature at issue is “server stabilization time” which says that after a server joins the cluster, it must spend some time healthy as a non-voter, i.e. not part of the quorum, before it can be promoted to a voting member.

Here is the timeline:

  • Initial state: Cluster size 3, quorum requires 2 of 3 servers
  • consul-server-2 down: Quorum is maintained by -1 and -0 still being up
  • consul-server-2 up, with a new IP address:
    • The current leader observes a new server with a known ID but a changed address, and removes the version of consul-server-2 with the old address from the cluster. The cluster size is now 2, and quorum requires all 2 of those 2 servers to remain up.
    • The current leader re-adds a new version of consul-server-2 to the cluster, but for now, as a non-voter. The voting cluster size is still only 2.
  • In 10 seconds, the non-voter would be eligible for promotion to a voter, but before that can happen…
  • consul-server-1 is taken down for maintenance! QUORUM NOW BROKEN. Only 1 voter member of 2 is available.

This is excellent, thank you!

We’ve already scaled our Consul cluster up to 5 to reduce the probability of this occurring again.

We are using the stock helm chart GitHub - hashicorp/consul-k8s: First-class support for Consul Service Mesh on Kubernetes with only a couple of tweaks:

{
  "performance": {
    "raft_multiplier": 1
  },
  "disable_update_check": true,
  "telemetry": {
    "prometheus_retention_time": "20s"
  }
}

I find it quite surprising that Consul + autopilot in it’s default configuration (as per the helm chart) would be at risk of this sort of issue?

Does the situation improve in future versions of Consul?

Cross posted over to GitHub: Failed leadership election with three node cluster in GKE (Consul v1.5.2) · Issue #1929 · hashicorp/consul-k8s · GitHub