3-server cluster becomes unstable when follower recovers from temporary network outage?

I’m running into a problem that seems to be occurring independently with both Consul (1.13.1 tested so far) as well as Nomad (1.3.5 tested so far).

I have a 3-server cluster, where if either of the non-leader nodes suffers a temporary network outage, the cluster continues to be ok during the outage, as well as afterward – if the outage was fairly short (eg seconds). However, if the outage lasts slightly longer (eg > 30 seconds), then as soon as connectivity resumes, the cluster immediately goes into a cycle where the leader continuously loses its leadership, a new leader is elected (usually the one that was the leader before), then loses its leadership again, and so on. This seems to continue for an arbitrary amount of time (I’ve not measured this in depth with Consul but at least for multiple minutes and with Nomad on one occasion this cycle continued for over an hour before it stabilized).

This is a bare metal cluster with 3 server nodes (each in a different DC in the same city) each acting as both a Consul and Nomad server node. Consul is being used only for KV (referenced in Nomad task templates), service discovery likewise within Nomad templates (ie consul-template) and health-checking of Nomad tasks. No DNS or connect. Consul is being run via systemd and Nomad is configured with knowledge of its own server IPs so that its not reliant on Consul to discover itself. (Both Nomad and Consul provisioned via the community ansible roles).

All nodes communicate via a nebula overlay network (also separately managed via systemd), so the config is fairly simple with no TLS either.

Basically, it looks like I can induce this behaviour simply by downing the nebula service on either one of the non-leader servers, waiting 30 seconds or so, and then upping the nebula service again.

The below log is an extract from the leader (landra on 192.168.100.50) from the point where connectivity on non-leader (solaria on 192.168.100.30) is restored.

2022-10-12T20:03:57.068-0400 [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.100.30:8300
2022-10-12T20:03:57.582-0400 [INFO]  agent.server.serf.lan: serf: EventMemberJoin: solaria 192.168.100.30
2022-10-12T20:03:57.583-0400 [INFO]  agent.server: Adding LAN server: server="solaria (Addr: tcp/192.168.100.30:8300) (DC: mtl)"
2022-10-12T20:03:57.583-0400 [INFO]  agent.server.raft: updating configuration: command=AddNonvoter server-id=c822f3cc-d9fa-6f5d-3053-b9f592c31c81 server-addr=192.168.100.30:8300 servers="[{Suffrage:Voter ID:b77
a3f6a-0636-7d24-9e4b-c80860619b20 Address:192.168.100.70:8300} {Suffrage:Voter ID:9270add6-1a06-578b-27b5-69c05bf1b2c9 Address:192.168.100.50:8300} {Suffrage:Nonvoter ID:c822f3cc-d9fa-6f5d-3053-b9f592c31c81 Addr
ess:192.168.100.30:8300}]"
2022-10-12T20:03:57.584-0400 [INFO]  agent.server.raft: added peer, starting replication: peer=c822f3cc-d9fa-6f5d-3053-b9f592c31c81
2022-10-12T20:03:57.587-0400 [ERROR] agent.server.raft: peer has newer term, stopping replication: peer="{Nonvoter c822f3cc-d9fa-6f5d-3053-b9f592c31c81 192.168.100.30:8300}"
2022-10-12T20:03:57.587-0400 [INFO]  agent.server.raft: entering follower state: follower="Node at 192.168.100.50:8300 [Follower]" leader-address= leader-id=
2022-10-12T20:03:57.587-0400 [ERROR] agent.server.autopilot: failed to add raft non-voting peer: id=c822f3cc-d9fa-6f5d-3053-b9f592c31c81 address=192.168.100.30:8300 error="leadership lost while committing log"
2022-10-12T20:03:57.587-0400 [ERROR] agent.server: failed to reconcile member: member="{solaria 192.168.100.30 8301 map[acls:0 build:1.13.1:c6d0f9ec dc:mtl expect:3 ft_fs:1 ft_si:1 id:c822f3cc-d9fa-6f5d-3053-b9f
592c31c81 port:8300 raft_vsn:3 role:consul segment: vsn:2 vsn_max:3 vsn_min:2 wan_join_port:8302] alive 1 5 2 2 5 4}" partition=default error="leadership lost while committing log"
2022-10-12T20:03:57.587-0400 [INFO]  agent.leader: stopped routine: routine="federation state pruning"
2022-10-12T20:03:57.587-0400 [INFO]  agent.server.raft: aborting pipeline replication: peer="{Voter b77a3f6a-0636-7d24-9e4b-c80860619b20 192.168.100.70:8300}"
2022-10-12T20:03:57.587-0400 [INFO]  agent.server.autopilot: reconciliation now disabled
2022-10-12T20:03:57.587-0400 [INFO]  agent.server: cluster leadership lost
2022-10-12T20:03:57.782-0400 [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.100.30:8300
2022-10-12T20:03:58.844-0400 [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2022-10-12T20:03:58.844-0400 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.50:8300 [Candidate]" term=4305
2022-10-12T20:03:58.852-0400 [INFO]  agent.server.raft: election won: tally=2
2022-10-12T20:03:58.852-0400 [INFO]  agent.server.raft: entering leader state: leader="Node at 192.168.100.50:8300 [Leader]"
2022-10-12T20:03:58.852-0400 [INFO]  agent.server.raft: added peer, starting replication: peer=b77a3f6a-0636-7d24-9e4b-c80860619b20
2022-10-12T20:03:58.852-0400 [INFO]  agent.server.raft: added peer, starting replication: peer=c822f3cc-d9fa-6f5d-3053-b9f592c31c81
2022-10-12T20:03:58.853-0400 [INFO]  agent.server: cluster leadership acquired
2022-10-12T20:03:58.853-0400 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T20:03:58.854-0400 [ERROR] agent.server.raft: peer has newer term, stopping replication: peer="{Nonvoter c822f3cc-d9fa-6f5d-3053-b9f592c31c81 192.168.100.30:8300}"
2022-10-12T20:03:58.854-0400 [INFO]  agent.server.raft: entering follower state: follower="Node at 192.168.100.50:8300 [Follower]" leader-address= leader-id=
2022-10-12T20:03:58.854-0400 [ERROR] agent.server: failed to wait for barrier: error="leadership lost while committing log"
2022-10-12T20:03:58.854-0400 [INFO]  agent.server: cluster leadership lost
2022-10-12T20:03:58.914-0400 [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.100.30:8300
2022-10-12T20:03:59.079-0400 [WARN]  agent.server.coordinate: Batch update failed: error="node is not the leader"
2022-10-12T20:04:00.269-0400 [WARN]  agent.server.raft: rejecting vote request since our last term is greater: candidate=192.168.100.70:8300 last-term=4305 last-candidate-term=4304
2022-10-12T20:04:00.484-0400 [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2022-10-12T20:04:00.484-0400 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.50:8300 [Candidate]" term=4307
2022-10-12T20:04:00.491-0400 [INFO]  agent.server.raft: election won: tally=2
2022-10-12T20:04:00.491-0400 [INFO]  agent.server.raft: entering leader state: leader="Node at 192.168.100.50:8300 [Leader]"
2022-10-12T20:04:00.491-0400 [INFO]  agent.server.raft: added peer, starting replication: peer=b77a3f6a-0636-7d24-9e4b-c80860619b20
2022-10-12T20:04:00.491-0400 [INFO]  agent.server.raft: added peer, starting replication: peer=c822f3cc-d9fa-6f5d-3053-b9f592c31c81
2022-10-12T20:04:00.491-0400 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T20:04:00.491-0400 [INFO]  agent.server: cluster leadership acquired
2022-10-12T20:04:00.493-0400 [ERROR] agent.server.raft: peer has newer term, stopping replication: peer="{Nonvoter c822f3cc-d9fa-6f5d-3053-b9f592c31c81 192.168.100.30:8300}"
2022-10-12T20:04:00.493-0400 [INFO]  agent.server.raft: entering follower state: follower="Node at 192.168.100.50:8300 [Follower]" leader-address= leader-id=
2022-10-12T20:04:00.493-0400 [ERROR] agent.server: failed to wait for barrier: error="leadership lost while committing log"
2022-10-12T20:04:00.493-0400 [INFO]  agent.server: cluster leadership lost
2022-10-12T20:04:00.493-0400 [WARN]  agent.server.raft: appendEntries rejected, sending older logs: peer="{Voter b77a3f6a-0636-7d24-9e4b-c80860619b20 192.168.100.70:8300}" next=7499660
2022-10-12T20:04:00.661-0400 [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.100.30:8300
2022-10-12T20:04:01.588-0400 [WARN]  agent.server.raft: rejecting vote request since our last term is greater: candidate=192.168.100.70:8300 last-term=4307 last-candidate-term=4304
2022-10-12T20:04:01.726-0400 [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.100.30:8300
2022-10-12T20:04:02.190-0400 [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2022-10-12T20:04:02.190-0400 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.50:8300 [Candidate]" term=4309
2022-10-12T20:04:02.199-0400 [INFO]  agent.server.raft: election won: tally=2
2022-10-12T20:04:02.199-0400 [INFO]  agent.server.raft: entering leader state: leader="Node at 192.168.100.50:8300 [Leader]"
2022-10-12T20:04:02.199-0400 [INFO]  agent.server.raft: added peer, starting replication: peer=b77a3f6a-0636-7d24-9e4b-c80860619b20
2022-10-12T20:04:02.199-0400 [INFO]  agent.server.raft: added peer, starting replication: peer=c822f3cc-d9fa-6f5d-3053-b9f592c31c81
2022-10-12T20:04:02.199-0400 [INFO]  agent.server: cluster leadership acquired
2022-10-12T20:04:02.199-0400 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T20:04:02.201-0400 [ERROR] agent.server.raft: peer has newer term, stopping replication: peer="{Nonvoter c822f3cc-d9fa-6f5d-3053-b9f592c31c81 192.168.100.30:8300}"
2022-10-12T20:04:02.201-0400 [INFO]  agent.server.raft: entering follower state: follower="Node at 192.168.100.50:8300 [Follower]" leader-address= leader-id=
2022-10-12T20:04:02.201-0400 [ERROR] agent.server: failed to wait for barrier: error="leadership lost while committing log"
2022-10-12T20:04:02.201-0400 [INFO]  agent.server: cluster leadership lost

Example config:

{
    "addresses": {
        "dns": "192.168.100.50",
        "grpc": "192.168.100.50",
        "http": "192.168.100.50",
        "https": "192.168.100.50"
    },
    "advertise_addr": "192.168.100.50",
    "advertise_addr_wan": "192.168.100.50",
    "bind_addr": "192.168.100.50",
    "bootstrap": false,
    "bootstrap_expect": 3,
    "client_addr": "192.168.100.50",
    "data_dir": "/opt/consul",
    "datacenter": "mtl",
    "disable_update_check": false,
    "domain": "consul",
    "enable_local_script_checks": false,
    "enable_script_checks": false,
    "encrypt": "...",
    "encrypt_verify_incoming": true,
    "encrypt_verify_outgoing": true,
    "log_file": "/var/log/consul/consul.log",
    "log_level": "INFO",
    "log_rotate_bytes": 0,
    "log_rotate_duration": "24h",
    "log_rotate_max_files": 0,
    "performance": {
        "leave_drain_time": "5s",
        "raft_multiplier": 1,
        "rpc_hold_timeout": "7s"
    },
    "ports": {
        "dns": 8600,
        "grpc": -1,
        "http": 8500,
        "https": -1,
        "serf_lan": 8301,
        "serf_wan": 8302,
        "server": 8300
    },
    "raft_protocol": 3,
    "retry_interval": "30s",
    "retry_interval_wan": "30s",
    "retry_join": [
        "192.168.100.30",
        "192.168.100.50",
        "192.168.100.70"
    ],
    "retry_max": 0,
    "retry_max_wan": 0,
    "server": true,
    "translate_wan_addrs": false,
    "ui_config": {
        "enabled": true
    }
}

I’ve not tested bigger raft_multiplier with Consul, but have with Nomad and the same leadership instability cycle still results, just occurs in slower motion.

What am I missing? Any insight gratefully appreciated. Logs etc from the Nomad side can be found in my post in the Nomad category from about 2 weeks ago and since I’ve made no headway in figuring this out.

May be useful to show the logs from the perspective of solaria (the non-leader in this example) from just prior to the connectivity being restored:

2022-10-12T17:03:50.078-0700 [WARN]  agent: error getting server health from server: server=landra error="context deadline exceeded"
2022-10-12T17:03:50.078-0700 [WARN]  agent: error getting server health from server: server=solaria error="context deadline exceeded"
2022-10-12T17:03:50.078-0700 [WARN]  agent: error getting server health from server: server=vitalia error="context deadline exceeded"
2022-10-12T17:03:50.161-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:03:50.161-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4337
2022-10-12T17:03:50.170-0700 [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter b77a3f6a-0636-7d24-9e4b-c80860619b20 192.168.100.70:8300}" error="dial tcp 192.168.100.30:0->192.168.100.70:8300: bind: cannot assign requested address"
2022-10-12T17:03:50.170-0700 [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 9270add6-1a06-578b-27b5-69c05bf1b2c9 192.168.100.50:8300}" error="dial tcp 192.168.100.30:0->192.168.100.50:8300: bind: cannot assign requested address"
2022-10-12T17:03:51.078-0700 [WARN]  agent: error getting server health from server: server=vitalia error="rpc error getting client: failed to get conn: dial tcp 192.168.100.30:0->192.168.100.70:8300: bind: cannot assign requested address"
2022-10-12T17:03:51.078-0700 [WARN]  agent: error getting server health from server: server=landra error="rpc error getting client: failed to get conn: dial tcp 192.168.100.30:0->192.168.100.50:8300: bind: cannot assign requested address"
2022-10-12T17:03:51.078-0700 [WARN]  agent: error getting server health from server: server=solaria error="rpc error getting client: failed to get conn: dial tcp 192.168.100.30:0->192.168.100.30:8300: bind: cannot assign requested address"
2022-10-12T17:03:51.633-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:03:51.633-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4338
2022-10-12T17:03:51.642-0700 [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 9270add6-1a06-578b-27b5-69c05bf1b2c9 192.168.100.50:8300}" error="dial tcp 192.168.100.30:0->192.168.100.50:8300: bind: cannot assign requested address"
2022-10-12T17:03:51.642-0700 [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter b77a3f6a-0636-7d24-9e4b-c80860619b20 192.168.100.70:8300}" error="dial tcp 192.168.100.30:0->192.168.100.70:8300: bind: cannot assign requested address"
2022-10-12T17:03:52.078-0700 [WARN]  agent: error getting server health from server: server=solaria error="context deadline exceeded"
2022-10-12T17:03:52.078-0700 [WARN]  agent: error getting server health from server: server=landra error="context deadline exceeded"
2022-10-12T17:03:52.078-0700 [WARN]  agent: error getting server health from server: server=vitalia error="context deadline exceeded"
2022-10-12T17:03:52.079-0700 [INFO]  agent.server.memberlist.wan: memberlist: Marking landra.mtl as failed, suspect timeout reached (0 peer confirmations)
2022-10-12T17:03:52.079-0700 [INFO]  agent.server.serf.wan: serf: EventMemberFailed: landra.mtl 192.168.100.50
2022-10-12T17:03:52.079-0700 [INFO]  agent.server: Handled event for server in area: event=member-failed server=landra.mtl area=wan
2022-10-12T17:03:52.203-0700 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {mtl-192.168.100.50:8300 landra.mtl <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.100.30:0->192.168.100.50:8300: bind: cannot assign requested address". Reconnecting...
2022-10-12T17:03:52.203-0700 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {mtl-192.168.100.30:8300 solaria <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.100.30:0->192.168.100.30:8300: bind: cannot assign requested address". Reconnecting...
2022-10-12T17:03:52.203-0700 [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {mtl-192.168.100.70:8300 vitalia.mtl <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.100.30:0->192.168.100.70:8300: bind: cannot assign requested address". Reconnecting...
2022-10-12T17:03:52.769-0700 [WARN]  agent.server.memberlist.wan: memberlist: Refuting a dead message (from: landra.mtl)
2022-10-12T17:03:53.080-0700 [INFO]  agent.server.serf.wan: serf: EventMemberJoin: landra.mtl 192.168.100.50
2022-10-12T17:03:53.080-0700 [INFO]  agent.server: Handled event for server in area: event=member-join server=landra.mtl area=wan
2022-10-12T17:03:53.534-0700 [ERROR] agent: Coordinate update error: error="No cluster leader"
2022-10-12T17:03:53.644-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:03:53.644-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4339
2022-10-12T17:03:55.084-0700 [INFO]  agent.server.memberlist.lan: memberlist: Suspect landra has failed, no acks received
2022-10-12T17:03:55.210-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:03:55.210-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4340
2022-10-12T17:03:55.854-0700 [ERROR] agent.server.raft: failed to decode incoming command: error="read tcp 192.168.100.30:8300->192.168.100.50:60051: read: connection reset by peer"
2022-10-12T17:03:56.068-0700 [INFO]  agent.server.memberlist.lan: memberlist: Marking foghorn as failed, suspect timeout reached (0 peer confirmations)
2022-10-12T17:03:56.068-0700 [INFO]  agent.server.serf.lan: serf: EventMemberFailed: foghorn 192.168.100.120
2022-10-12T17:03:57.060-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:03:57.060-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4341
2022-10-12T17:03:57.081-0700 [INFO]  agent.server.memberlist.wan: memberlist: Suspect landra.mtl has failed, no acks received
2022-10-12T17:03:57.370-0700 [INFO]  agent.server.serf.lan: serf: EventMemberJoin: foghorn 192.168.100.120
2022-10-12T17:03:57.583-0700 [INFO]  agent.server.serf.lan: serf: EventMemberJoin: flynn 192.168.100.110
2022-10-12T17:03:57.762-0700 [INFO]  agent.server.serf.lan: serf: EventMemberJoin: hyperion 192.168.100.20
2022-10-12T17:03:57.763-0700 [INFO]  agent.server.serf.lan: serf: EventMemberJoin: linkstorm 192.168.100.140
2022-10-12T17:03:58.883-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:03:58.907-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:03:58.907-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4342
2022-10-12T17:04:00.650-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:04:00.650-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4343
2022-10-12T17:04:00.982-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:01.715-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:04:01.716-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4344
2022-10-12T17:04:02.282-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:02.689-0700 [WARN]  agent: Check is now critical: check=_nomad-check-08628afa1d9e7179806361b9ff303ae1cb51550c
2022-10-12T17:04:02.823-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:04:02.823-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4345
2022-10-12T17:04:03.857-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:04.294-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:04:04.294-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4346
2022-10-12T17:04:05.083-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:06.292-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:04:06.292-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4347
2022-10-12T17:04:07.056-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:07.426-0700 [WARN]  agent: Syncing check failed.: check=_nomad-check-08628afa1d9e7179806361b9ff303ae1cb51550c error="No cluster leader"
2022-10-12T17:04:07.426-0700 [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2022-10-12T17:04:08.245-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:04:08.245-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4348
2022-10-12T17:04:08.282-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:09.327-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:04:09.327-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4349
2022-10-12T17:04:09.962-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:11.312-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:04:11.312-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4350
2022-10-12T17:04:11.338-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:12.909-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:04:12.909-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4351
2022-10-12T17:04:13.282-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:14.306-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:04:14.306-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4352
2022-10-12T17:04:14.329-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:16.007-0700 [WARN]  agent: Syncing check failed.: check=_nomad-check-08628afa1d9e7179806361b9ff303ae1cb51550c error="No cluster leader"
2022-10-12T17:04:16.007-0700 [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader"
2022-10-12T17:04:16.193-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:04:16.193-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4353
2022-10-12T17:04:16.283-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:17.101-0700 [INFO]  agent.server.serf.wan: serf: attempting reconnect to vitalia.mtl 192.168.100.70:8302
2022-10-12T17:04:17.437-0700 [INFO]  agent.server.serf.wan: serf: EventMemberJoin: vitalia.mtl 192.168.100.70
2022-10-12T17:04:17.437-0700 [INFO]  agent.server: Handled event for server in area: event=member-join server=vitalia.mtl area=wan
2022-10-12T17:04:17.448-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:17.997-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:04:17.997-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4354
2022-10-12T17:04:19.282-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:19.394-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:04:19.394-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4355
2022-10-12T17:04:20.926-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:04:20.926-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4356
2022-10-12T17:04:21.056-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:22.081-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:22.389-0700 [WARN]  agent.server.raft: Election timeout reached, restarting election
2022-10-12T17:04:22.389-0700 [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.100.30:8300 [Candidate]" term=4357
2022-10-12T17:04:22.427-0700 [WARN]  agent: Check is now critical: check=_nomad-check-08628afa1d9e7179806361b9ff303ae1cb51550c
2022-10-12T17:04:23.361-0700 [INFO]  agent.server: New leader elected: payload=landra
2022-10-12T17:04:23.765-0700 [ERROR] agent: Coordinate update error: error="No cluster leader"

I have an idea what’s going on here… on the agent.server.raft: updating configuration: log line you can see that solaria is being added as a new member of the cluster … at least from the perspective of landra - whilst solaria presumably thinks it was a member of the cluster throughout, since it is apparently sending out vote requests.

So, the question is how did solaria get removed from the Raft cluster membership on landra? My guess is that the “autopilot” feature to “cleanup dead servers” is responsible.

This feature is supposed to make it easy for cluster admins by automatically dropping dead servers from the Raft peer set so they don’t have to manually maintain it… but if a “dead” server is removed and then comes back… this happens. The feature meant to make your life easy, instead shoots you in the foot.

Check Commands: Operator Autopilot | Consul | HashiCorp Developer, and see if changing the cluster config using

consul operator autopilot set-config -cleanup-dead-servers=false

makes this issue no longer occur.

Alternatively you could leave the feature enabled but look to the -min-quorum setting to control when it operates.

The last part of the explanation you need to understand the observed behaviour, is that whilst solaria was isolated, it was repeatedly starting elections and having them time out. This has led to its term number being inflated. It’s because it’s communicating with the other nodes with a larger term number, that it forces elections to take place even though the cluster had an existing quorum.

This is an area where the Raft protocol has prioritised simplicity over stability - any one node (solaria) gets to insist on having a new election whenever it likes, immediately deposing a current leader (landra). There’s no way for the existing quorum (landra and the third node) to get together and tell solaria “No, we don’t need an election right now”.

2 Likes

Thank you! Disabling cleanup-dead-servers indeed seems to stop this behaviour from occurring, in both Consul as well as Nomad, tested separately.

I’ll update the related Nomad thread with this finding too, because this seems to be the default autopilot config, and so it seems like this type of scenario could happen to almost anyone out of the box if server connectivity is out for just a few seconds too long. Maybe the longer-term approach for autopilot here would be to add a configurable reaping time, so it doesn’t reap quite so quickly?

As a sidebar, I would feel a bit suspicious of using the -min-quorum setting. It was 0 in my case from operator autopilot get-config, which seems like the default. Without looking at the code, this is just speculation, but I guess what is happening is something along the lines of landra’s partition reaping because it has a leader, whereas solaria’s partition doesn’t (when it should perhaps also be ‘reaping’ the other 2 as ‘dead’ at the same time). I wonder if anybody knows what the difference would be between -min-quorum=0 and -min-quorum=1 behaviour here? Presumably in my case -min-quorum=3 would be the approach, but then if the cluster were ever increased to 5 servers, it seems there could be a danger of this type of issue recurring without revisiting that setting then.

I happened to notice on the nomad release track today’s releases mention a fix for the following issue: Unstable leadership when running server is demoted/removed without its participation · Issue #524 · hashicorp/raft · GitHub

I haven’t confirmed yet, but it looks like this was the underlying cause when autopilot dead server removal is active, and that that should thus now be remedied in the new releases for consul, nomad and vault.

Hi @pcc,

The raft fix you found below will fix the issue you were facing. I strongly recommend leaving the cleanup-dead-servers config to its default (true) once you get the next Consul release with the fix.

The documentation also suggests this feature to be turned on (on by default), unless you have an external mechanism to cleanup dead servers.

ref: Autopilot | Consul | HashiCorp Developer