Consul failing to commit leader election results

After a brief network interruption, my 3 node Consul cluster is down. The logs show a leader election happening, a node getting elected, but then an error about a peer having a newer term or no leader present.

Here are the logs from one of the servers, they are similar on each one.

Nov 18 16:15:12 n1 consul[2869524]: 2022-11-18T16:15:12.124Z [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr=192.168.2.102:8300 last-leader-id=fc373d0b-3aed
-44d4-ca0a-fe7f12d0118b
Nov 18 16:15:12 n1 consul[2869524]: 2022-11-18T16:15:12.124Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.101:8300 [Candidate]" term=11024
Nov 18 16:15:12 n1 consul[2869524]: 2022-11-18T16:15:12.599Z [INFO]  agent.server.raft: duplicate requestVote for same term: term=11024
Nov 18 16:15:12 n1 consul[2869524]: 2022-11-18T16:15:12.749Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 18 16:15:13 n1 consul[2869524]: 2022-11-18T16:15:13.692Z [WARN]  agent.server.raft: Election timeout reached, restarting election
Nov 18 16:15:13 n1 consul[2869524]: 2022-11-18T16:15:13.692Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.101:8300 [Candidate]" term=11025
Nov 18 16:15:13 n1 consul[2869524]: 2022-11-18T16:15:13.816Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 18 16:15:13 n1 consul[2869524]: 2022-11-18T16:15:13.916Z [INFO]  agent.server.raft: election won: tally=2
Nov 18 16:15:13 n1 consul[2869524]: 2022-11-18T16:15:13.916Z [INFO]  agent.server.raft: entering leader state: leader="Node at 192.168.2.101:8300 [Leader]"
Nov 18 16:15:13 n1 consul[2869524]: 2022-11-18T16:15:13.916Z [INFO]  agent.server.raft: added peer, starting replication: peer=fc373d0b-3aed-44d4-ca0a-fe7f12d0118b
Nov 18 16:15:13 n1 consul[2869524]: 2022-11-18T16:15:13.916Z [INFO]  agent.server.raft: added peer, starting replication: peer=fa4f7537-f206-1d53-a204-54fe44621258
Nov 18 16:15:13 n1 consul[2869524]: 2022-11-18T16:15:13.917Z [INFO]  agent.server: cluster leadership acquired
Nov 18 16:15:13 n1 consul[2869524]: 2022-11-18T16:15:13.917Z [INFO]  agent.server: New leader elected: payload=n1
Nov 18 16:15:13 n1 consul[2869524]: 2022-11-18T16:15:13.917Z [INFO]  agent.server.raft: pipelining replication: peer="{Voter fc373d0b-3aed-44d4-ca0a-fe7f12d0118b 192.168.2.102:8300}"
Nov 18 16:15:13 n1 consul[2869524]: 2022-11-18T16:15:13.918Z [ERROR] agent.server.raft: peer has newer term, stopping replication: peer="{Nonvoter fa4f7537-f206-1d53-a204-54fe44621258 192.168.2.30:8
300}"
Nov 18 16:15:13 n1 consul[2869524]: 2022-11-18T16:15:13.949Z [INFO]  agent.server.raft: aborting pipeline replication: peer="{Voter fc373d0b-3aed-44d4-ca0a-fe7f12d0118b 192.168.2.102:8300}"
Nov 18 16:15:13 n1 consul[2869524]: 2022-11-18T16:15:13.950Z [INFO]  agent.server.raft: entering follower state: follower="Node at 192.168.2.101:8300 [Follower]" leader-address= leader-id=
Nov 18 16:15:13 n1 consul[2869524]: 2022-11-18T16:15:13.950Z [ERROR] agent.server: failed to wait for barrier: error="node is not the leader"
Nov 18 16:15:13 n1 consul[2869524]: 2022-11-18T16:15:13.950Z [INFO]  agent.server: cluster leadership lost
Nov 18 16:15:15 n1 consul[2869524]: 2022-11-18T16:15:15.218Z [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
Nov 18 16:15:15 n1 consul[2869524]: 2022-11-18T16:15:15.218Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.101:8300 [Candidate]" term=11026
Nov 18 16:15:15 n1 consul[2869524]: 2022-11-18T16:15:15.399Z [INFO]  agent.server.raft: election won: tally=2
Nov 18 16:15:15 n1 consul[2869524]: 2022-11-18T16:15:15.399Z [INFO]  agent.server.raft: entering leader state: leader="Node at 192.168.2.101:8300 [Leader]"
Nov 18 16:15:15 n1 consul[2869524]: 2022-11-18T16:15:15.399Z [INFO]  agent.server.raft: added peer, starting replication: peer=fc373d0b-3aed-44d4-ca0a-fe7f12d0118b
Nov 18 16:15:15 n1 consul[2869524]: 2022-11-18T16:15:15.399Z [INFO]  agent.server.raft: added peer, starting replication: peer=fa4f7537-f206-1d53-a204-54fe44621258
Nov 18 16:15:15 n1 consul[2869524]: 2022-11-18T16:15:15.400Z [INFO]  agent.server: cluster leadership acquired
Nov 18 16:15:15 n1 consul[2869524]: 2022-11-18T16:15:15.400Z [INFO]  agent.server: New leader elected: payload=n1
Nov 18 16:15:15 n1 consul[2869524]: 2022-11-18T16:15:15.401Z [INFO]  agent.server.raft: pipelining replication: peer="{Voter fc373d0b-3aed-44d4-ca0a-fe7f12d0118b 192.168.2.102:8300}"
Nov 18 16:15:15 n1 consul[2869524]: 2022-11-18T16:15:15.401Z [ERROR] agent.server.raft: peer has newer term, stopping replication: peer="{Nonvoter fa4f7537-f206-1d53-a204-54fe44621258 192.168.2.30:8
300}"
Nov 18 16:15:15 n1 consul[2869524]: 2022-11-18T16:15:15.444Z [INFO]  agent.server.raft: entering follower state: follower="Node at 192.168.2.101:8300 [Follower]" leader-address= leader-id=
Nov 18 16:15:15 n1 consul[2869524]: 2022-11-18T16:15:15.444Z [INFO]  agent.server.raft: aborting pipeline replication: peer="{Voter fc373d0b-3aed-44d4-ca0a-fe7f12d0118b 192.168.2.102:8300}"
Nov 18 16:15:15 n1 consul[2869524]: 2022-11-18T16:15:15.444Z [ERROR] agent.server: failed to wait for barrier: error="node is not the leader"
Nov 18 16:15:15 n1 consul[2869524]: 2022-11-18T16:15:15.444Z [INFO]  agent.server: cluster leadership lost
Nov 18 16:15:15 n1 consul[2869524]: 2022-11-18T16:15:15.447Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.561Z [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.561Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.101:8300 [Candidate]" term=11027
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.750Z [INFO]  agent.server.raft: election won: tally=2
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.750Z [INFO]  agent.server.raft: entering leader state: leader="Node at 192.168.2.101:8300 [Leader]"
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.750Z [INFO]  agent.server.raft: added peer, starting replication: peer=fc373d0b-3aed-44d4-ca0a-fe7f12d0118b
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.750Z [INFO]  agent.server.raft: added peer, starting replication: peer=fa4f7537-f206-1d53-a204-54fe44621258
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.750Z [INFO]  agent.server: cluster leadership acquired
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.750Z [INFO]  agent.server: New leader elected: payload=n1
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.751Z [INFO]  agent.server.raft: pipelining replication: peer="{Voter fc373d0b-3aed-44d4-ca0a-fe7f12d0118b 192.168.2.102:8300}"
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.751Z [ERROR] agent.server.raft: peer has newer term, stopping replication: peer="{Nonvoter fa4f7537-f206-1d53-a204-54fe44621258 192.168.2.30:8
300}"
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.839Z [INFO]  agent.server.raft: entering follower state: follower="Node at 192.168.2.101:8300 [Follower]" leader-address= leader-id=
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.839Z [INFO]  agent.server.raft: aborting pipeline replication: peer="{Voter fc373d0b-3aed-44d4-ca0a-fe7f12d0118b 192.168.2.102:8300}"
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.839Z [ERROR] agent.server: failed to wait for barrier: error="node is not the leader"
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.839Z [INFO]  agent.server: cluster leadership lost
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.842Z [WARN]  agent: Deregistering service failed.: service=_nomad-task-0821da93-abd5-54a1-76cd-5789a445b09c-group-lldap-ldap-admin-web error="
rpc error making call: rpc error making call: node is not the leader"
Nov 18 16:15:16 n1 consul[2869524]: 2022-11-18T16:15:16.842Z [ERROR] agent.anti_entropy: failed to sync remote state: error="rpc error making call: rpc error making call: node is not the leader"
Nov 18 16:15:17 n1 consul[2869524]: 2022-11-18T16:15:17.002Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 18 16:15:18 n1 consul[2869524]: 2022-11-18T16:15:18.447Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 18 16:15:18 n1 consul[2869524]: 2022-11-18T16:15:18.514Z [INFO]  agent.server: New leader elected: payload=n2
Nov 18 16:15:20 n1 consul[2869524]: 2022-11-18T16:15:20.127Z [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr=192.168.2.102:8300 last-leader-id=fc373d0b-3aed
-44d4-ca0a-fe7f12d0118b
Nov 18 16:15:20 n1 consul[2869524]: 2022-11-18T16:15:20.127Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.101:8300 [Candidate]" term=11029
Nov 18 16:15:20 n1 consul[2869524]: 2022-11-18T16:15:20.270Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 18 16:15:20 n1 consul[2869524]: 2022-11-18T16:15:20.444Z [INFO]  agent.server.raft: election won: tally=2
Nov 18 16:15:20 n1 consul[2869524]: 2022-11-18T16:15:20.444Z [INFO]  agent.server.raft: entering leader state: leader="Node at 192.168.2.101:8300 [Leader]"
Nov 18 16:15:20 n1 consul[2869524]: 2022-11-18T16:15:20.444Z [INFO]  agent.server.raft: added peer, starting replication: peer=fc373d0b-3aed-44d4-ca0a-fe7f12d0118b
Nov 18 16:15:20 n1 consul[2869524]: 2022-11-18T16:15:20.444Z [INFO]  agent.server.raft: added peer, starting replication: peer=fa4f7537-f206-1d53-a204-54fe44621258
Nov 18 16:15:20 n1 consul[2869524]: 2022-11-18T16:15:20.446Z [INFO]  agent.server: cluster leadership acquired
Nov 18 16:15:20 n1 consul[2869524]: 2022-11-18T16:15:20.446Z [INFO]  agent.server: New leader elected: payload=n1
Nov 18 16:15:20 n1 consul[2869524]: 2022-11-18T16:15:20.446Z [INFO]  agent.server.raft: pipelining replication: peer="{Voter fc373d0b-3aed-44d4-ca0a-fe7f12d0118b 192.168.2.102:8300}"
Nov 18 16:15:20 n1 consul[2869524]: 2022-11-18T16:15:20.447Z [ERROR] agent.server.raft: peer has newer term, stopping replication: peer="{Nonvoter fa4f7537-f206-1d53-a204-54fe44621258 192.168.2.30:8
300}"
Nov 18 16:15:20 n1 consul[2869524]: 2022-11-18T16:15:20.602Z [INFO]  agent.server.raft: entering follower state: follower="Node at 192.168.2.101:8300 [Follower]" leader-address= leader-id=
Nov 18 16:15:20 n1 consul[2869524]: 2022-11-18T16:15:20.603Z [INFO]  agent.server.raft: aborting pipeline replication: peer="{Voter fc373d0b-3aed-44d4-ca0a-fe7f12d0118b 192.168.2.102:8300}"
Nov 18 16:15:20 n1 consul[2869524]: 2022-11-18T16:15:20.603Z [ERROR] agent.server: failed to wait for barrier: error="leadership lost while committing log"

It’s also interesting that one of the nodes was made a backup voter. Perhaps that was autopilot’s doing.

Tried to recover with a peers.json file, and now it’s even more broken.

Nov 18 16:35:51 n2 systemd[1]: Started "HashiCorp Consul - A service mesh solution".
Nov 18 16:35:52 n2 consul[2613065]: ==> Starting Consul agent...
Nov 18 16:35:52 n2 consul[2613065]:               Version: '1.13.3'
Nov 18 16:35:52 n2 consul[2613065]:            Build Date: '2022-10-19 19:49:59 +0000 UTC'
Nov 18 16:35:52 n2 consul[2613065]:               Node ID: 'fc373d0b-3aed-44d4-ca0a-fe7f12d0118b'
Nov 18 16:35:52 n2 consul[2613065]:             Node name: 'n2'
Nov 18 16:35:52 n2 consul[2613065]:            Datacenter: 'dc1' (Segment: '<all>')
Nov 18 16:35:52 n2 consul[2613065]:                Server: true (Bootstrap: false)
Nov 18 16:35:52 n2 consul[2613065]:           Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, gRPC: 8502, DNS: 8600)
Nov 18 16:35:52 n2 consul[2613065]:          Cluster Addr: 192.168.2.102 (LAN: 8301, WAN: 8302)
Nov 18 16:35:52 n2 consul[2613065]:     Gossip Encryption: true
Nov 18 16:35:52 n2 consul[2613065]:      Auto-Encrypt-TLS: false
Nov 18 16:35:52 n2 consul[2613065]:             HTTPS TLS: Verify Incoming: false, Verify Outgoing: false, Min Version: TLSv1_2
Nov 18 16:35:52 n2 consul[2613065]:              gRPC TLS: Verify Incoming: false, Min Version: TLSv1_2
Nov 18 16:35:52 n2 consul[2613065]:      Internal RPC TLS: Verify Incoming: false, Verify Outgoing: false (Verify Hostname: false), Min Version: TLSv1_2
Nov 18 16:35:52 n2 consul[2613065]: ==> Log data will now stream in as it occurs:
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.008Z [WARN]  agent: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.009Z [WARN]  agent: bootstrap_expect > 0: expecting 3 servers
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.025Z [WARN]  agent.auto_config: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.025Z [WARN]  agent.auto_config: bootstrap_expect > 0: expecting 3 servers
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.160Z [INFO]  agent.server: found peers.json file, recovering Raft configuration...
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.209Z [INFO]  agent.server.raft: snapshot restore progress: id=11327-139852-1668788892898 last-index=139852 last-term=11327 size-in-bytes=447167 read-bytes=447167 percent-complete=100.00%
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.209Z [INFO]  agent.server.fsm: snapshot created: duration=41.694µs
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.209Z [INFO]  agent.server.snapshot: creating new snapshot: path=/opt/consul/raft/snapshots/11327-139852-1668789352209.tmp
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.210Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=NmVmYTlkYTItYzI0OS1iNjY4LTY2OGMtOTlhY2E2OTI2NDEz fallback=192.168.2.101:8300 error="Could not find address for server id NmVmYTlkYTItYzI0OS1iNjY4LTY2OGMtOTlhY2E2OTI2NDEz"
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.210Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=ZmMzNzNkMGItM2FlZC00NGQ0LWNhMGEtZmU3ZjEyZDAxMThi fallback=192.168.2.102:8300 error="Could not find address for server id ZmMzNzNkMGItM2FlZC00NGQ0LWNhMGEtZmU3ZjEyZDAxMThi"
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.210Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=ZmE0Zjc1MzctZjIwNi0xZDUzLWEyMDQtNTRmZTQ0NjIxMjU4 fallback=192.168.2.30:8300 error="Could not find address for server id ZmE0Zjc1MzctZjIwNi0xZDUzLWEyMDQtNTRmZTQ0NjIxMjU4"
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.529Z [INFO]  agent.server.snapshot: reaping snapshot: path=/opt/consul/raft/snapshots/11327-139852-1668788618151
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.592Z [INFO]  agent.server: deleted peers.json file after successful recovery
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.692Z [INFO]  agent.server.raft: starting restore from snapshot: id=11327-139852-1668789352209 last-index=139852 last-term=11327 size-in-bytes=447167
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.728Z [INFO]  agent.server.raft: snapshot restore progress: id=11327-139852-1668789352209 last-index=139852 last-term=11327 size-in-bytes=447167 read-bytes=447167 percent-complete=100.00%
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.728Z [INFO]  agent.server.raft: restored from snapshot: id=11327-139852-1668789352209 last-index=139852 last-term=11327 size-in-bytes=447167
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.728Z [INFO]  agent.server.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:NmVmYTlkYTItYzI0OS1iNjY4LTY2OGMtOTlhY2E2OTI2NDEz Address:192.168.2.101:8300} {Suffrage:Voter ID:ZmMzNzNkMGItM2FlZC00NGQ0LWNhMGEtZmU3ZjEyZDAxMThi Address:192.168.2.102:8300} {Suffrage:Voter ID:ZmE0Zjc1MzctZjIwNi0xZDUzLWEyMDQtNTRmZTQ0NjIxMjU4 Address:192.168.2.30:8300}]"
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.728Z [INFO]  agent.server.raft: entering follower state: follower="Node at 192.168.2.102:8300 [Follower]" leader-address= leader-id=
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.729Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: n2.dc1 192.168.2.102
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.730Z [INFO]  agent.server.serf.wan: serf: Attempting re-join to previously known node: n1.dc1: 192.168.2.101:8302
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.730Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: n2 192.168.2.102
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.731Z [INFO]  agent.router: Initializing LAN area manager
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.731Z [INFO]  agent.server.autopilot: reconciliation now disabled
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.732Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: n1.dc1 192.168.2.101
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.733Z [INFO]  agent.server.serf.wan: serf: Re-joined to previously known node: n1.dc1: 192.168.2.101:8302
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.733Z [INFO]  agent.server.serf.lan: serf: Attempting re-join to previously known node: ubuntu: 192.168.2.30:8301
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.734Z [INFO]  agent.server: Adding LAN server: server="n2 (Addr: tcp/192.168.2.102:8300) (DC: dc1)"
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.734Z [INFO]  agent.server.serf.lan: serf: Attempting re-join to previously known node: n1: 192.168.2.101:8301
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.735Z [INFO]  agent.server: Handled event for server in area: event=member-join server=n2.dc1 area=wan
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.735Z [INFO]  agent.server: Handled event for server in area: event=member-join server=n1.dc1 area=wan
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.737Z [WARN]  agent: [core]grpc: addrConn.createTransport failed to connect to {dc1-192.168.2.102:8300 n2 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.2.102:0->192.168.2.102:8300: operation was canceled". Reconnecting...
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.737Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: n1 192.168.2.101
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.737Z [INFO]  agent.server.serf.lan: serf: Re-joined to previously known node: n1: 192.168.2.101:8301
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.738Z [INFO]  agent.server: Adding LAN server: server="n1 (Addr: tcp/192.168.2.101:8300) (DC: dc1)"
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.746Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=udp
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.746Z [INFO]  agent: Started DNS server: address=0.0.0.0:8600 network=tcp
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.747Z [INFO]  agent: Starting server: address=[::]:8500 network=tcp protocol=http
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.749Z [INFO]  agent: Started gRPC server: address=[::]:8502 network=tcp
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.750Z [INFO]  agent: Retry join is supported for the following discovery methods: cluster=LAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.750Z [INFO]  agent: Joining cluster...: cluster=LAN
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.750Z [INFO]  agent: (LAN) joining: lan_addresses=[192.168.2.101, 192.168.2.102, 192.168.2.30]
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.750Z [INFO]  agent: started state syncer
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.750Z [INFO]  agent: Consul agent running!
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.752Z [INFO]  agent: (LAN) joined: number_of_nodes=2
Nov 18 16:35:52 n2 consul[2613065]: 2022-11-18T16:35:52.753Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=2
Nov 18 16:35:53 n2 consul[2613065]: 2022-11-18T16:35:53.398Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: ubuntu.dc1 192.168.2.30
Nov 18 16:35:53 n2 consul[2613065]: 2022-11-18T16:35:53.398Z [INFO]  agent.server: Handled event for server in area: event=member-join server=ubuntu.dc1 area=wan
Nov 18 16:35:53 n2 consul[2613065]: 2022-11-18T16:35:53.413Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: ubuntu 192.168.2.30
Nov 18 16:35:53 n2 consul[2613065]: 2022-11-18T16:35:53.413Z [INFO]  agent.server: Adding LAN server: server="ubuntu (Addr: tcp/192.168.2.30:8300) (DC: dc1)"
Nov 18 16:35:53 n2 consul[2613065]: 2022-11-18T16:35:53.415Z [INFO]  agent.server: Existing Raft peers reported by server, disabling bootstrap mode: server=ubuntu
Nov 18 16:35:54 n2 consul[2613065]: 2022-11-18T16:35:54.527Z [WARN]  agent.server.raft: not part of stable configuration, aborting election
Nov 18 16:35:59 n2 consul[2613065]: 2022-11-18T16:35:59.756Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=resolved-service-config error="No cluster leader" index=0
Nov 18 16:35:59 n2 consul[2613065]: 2022-11-18T16:35:59.756Z [ERROR] agent: error handling service update: error="error watching service config: No cluster leader"
Nov 18 16:35:59 n2 consul[2613065]: 2022-11-18T16:35:59.771Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=resolved-service-config error="No cluster leader" index=0
Nov 18 16:35:59 n2 consul[2613065]: 2022-11-18T16:35:59.771Z [ERROR] agent: error handling service update: error="error watching service config: No cluster leader"
Nov 18 16:35:59 n2 consul[2613065]: 2022-11-18T16:35:59.784Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=resolved-service-config error="No cluster leader" index=0
Nov 18 16:35:59 n2 consul[2613065]: 2022-11-18T16:35:59.784Z [ERROR] agent: error handling service update: error="error watching service config: No cluster leader"
Nov 18 16:35:59 n2 consul[2613065]: 2022-11-18T16:35:59.864Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=connect-ca-leaf error="No cluster leader" index=0
Nov 18 16:35:59 n2 consul[2613065]: 2022-11-18T16:35:59.865Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=connect-ca-root error="No cluster leader" index=0
Nov 18 16:35:59 n2 consul[2613065]: 2022-11-18T16:35:59.865Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=connect-ca-leaf error="No cluster leader" index=0
Nov 18 16:35:59 n2 consul[2613065]: 2022-11-18T16:35:59.865Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=connect-ca-leaf error="No cluster leader" index=0

The first problem:

During the network interruption, it appears the various nodes have got into a state where they disagree about the voting membership of the cluster.

Consul has an autopilot feature for removing dead servers from the Raft quorum automatically. However that same feature breaks your cluster if it triggers on a server which isn’t actually dead and later tries to rejoin.

I suspect this may be what happened here.

You may want to consider changing the relevant autopilot config - Commands: Operator Autopilot | Consul | HashiCorp Developer - either -cleanup-dead-servers=false or -min-quorum=ACTUAL_NUMBER_OF_SERVERS

The second problem:

When you attempted peers.json recovery you specified incorrect server IDs.

Some correct IDs from the first log:

Incorrect IDs from the second log:

Oh, interesting. I hadn’t noticed that about the node ids before. It’s the value that I’ve gotten from the node-id file though.

I’ve been automating this with an Ansible playbook

---
- name: Recover Consul
  hosts: consul_instances

  tasks:
    - name: Stop Consul
      systemd:
        name: consul
        state: stopped
      become: true

    - name: Get node-id
      slurp:
        src: /opt/consul/node-id
      register: consul_node_id
      become: true

    - name: Node Info
      debug:
        msg: |
          node_id: {{ consul_node_id.content }}
          address: {{ ansible_default_ipv4.address }}

    - name: Save
      copy:
        dest: "/opt/consul/raft/peers.json"
        # I used to have reject('equalto', inventory_hostname) in the loop, but I'm not sure if I should
        content: |
          [
          {% for host in ansible_play_hosts -%}
            {
            "id": "{{ hostvars[host].consul_node_id.content }}",
            "address": "{{ hostvars[host].ansible_default_ipv4.address }}:8300",
            "non_voter": false
            }{% if not loop.last %},{% endif %}
          {% endfor -%}
          ]
      become: true

    - name: Restart Consul
      systemd:
        name: consul
        state: restarted
      become: true

I’ve just realised that the incorrect IDs are the correct IDs after having been run through base64 encoding

I ended up recovering by wiping all the data and starting a fresh cluster. Now I just encountered the same issue again today. This time, no network interruption happened, as far as I can tell.

Can you provide logs that start before the problem occurring through to afterwards?

Sure. Here’s logs from two nodes from right around the time the issue arose. There was a lot of service health checks on the client, so I filtered the logs to the server only.

Nov 22 02:18:22 n2 consul[2715752]: 2022-11-22T02:18:22.564Z [INFO]  agent.server.raft: starting snapshot up to: index=16387
Nov 22 02:18:22 n2 consul[2715752]: 2022-11-22T02:18:22.564Z [INFO]  agent.server.snapshot: creating new snapshot: path=/opt/consul/raft/snapshots/6-16387-1669083502564.tmp
Nov 22 02:18:22 n2 consul[2715752]: 2022-11-22T02:18:22.743Z [INFO]  agent.server.raft: compacting logs: from=1 to=6147
Nov 22 02:18:23 n2 consul[2715752]: 2022-11-22T02:18:23.469Z [INFO]  agent.server.raft: snapshot complete up to: index=16387
Nov 22 11:00:24 n2 consul[2715752]: 2022-11-22T11:00:24.102Z [INFO]  agent.server.memberlist.lan: memberlist: Suspect ubuntu has failed, no acks received
Nov 22 11:00:27 n2 consul[2715752]: 2022-11-22T11:00:27.103Z [INFO]  agent.server.memberlist.lan: memberlist: Suspect ubuntu has failed, no acks received
Nov 22 11:00:28 n2 consul[2715752]: 2022-11-22T11:00:28.073Z [INFO]  agent.server.memberlist.lan: memberlist: Marking ubuntu as failed, suspect timeout reached (0 peer confirmations)
Nov 22 11:00:28 n2 consul[2715752]: 2022-11-22T11:00:28.073Z [INFO]  agent.server.serf.lan: serf: EventMemberFailed: ubuntu 192.168.2.30
Nov 22 11:00:28 n2 consul[2715752]: 2022-11-22T11:00:28.073Z [INFO]  agent.server: Removing LAN server: server="ubuntu (Addr: tcp/192.168.2.30:8300) (DC: dc1)"
Nov 22 11:00:29 n2 consul[2715752]: 2022-11-22T11:00:29.104Z [INFO]  agent.server.memberlist.lan: memberlist: Suspect ubuntu has failed, no acks received
Nov 22 11:00:29 n2 consul[2715752]: 2022-11-22T11:00:29.105Z [INFO]  agent.server.serf.lan: serf: EventMemberLeave (forced): ubuntu 192.168.2.30
Nov 22 11:00:29 n2 consul[2715752]: 2022-11-22T11:00:29.105Z [INFO]  agent.server: Removing LAN server: server="ubuntu (Addr: tcp/192.168.2.30:8300) (DC: dc1)"
Nov 22 11:00:34 n2 consul[2715752]: 2022-11-22T11:00:34.969Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 22 11:00:35 n2 consul[2715752]: 2022-11-22T11:00:35.043Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 22 11:00:35 n2 consul[2715752]: 2022-11-22T11:00:35.102Z [INFO]  agent.server.memberlist.wan: memberlist: Suspect ubuntu.dc1 has failed, no acks received
Nov 22 11:00:35 n2 consul[2715752]: 2022-11-22T11:00:35.395Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 22 11:00:36 n2 consul[2715752]: 2022-11-22T11:00:36.324Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 22 11:00:36 n2 consul[2715752]: 2022-11-22T11:00:36.752Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 22 11:00:36 n2 consul[2715752]: 2022-11-22T11:00:36.772Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 22 11:00:36 n2 consul[2715752]: 2022-11-22T11:00:36.834Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 22 11:00:37 n2 consul[2715752]: 2022-11-22T11:00:37.858Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 22 11:00:38 n2 consul[2715752]: 2022-11-22T11:00:38.115Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 22 11:00:38 n2 consul[2715752]: 2022-11-22T11:00:38.141Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 22 11:00:38 n2 consul[2715752]: 2022-11-22T11:00:38.230Z [WARN]  agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: n2)
Nov 22 11:00:38 n2 consul[2715752]: 2022-11-22T11:00:38.249Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: ubuntu 192.168.2.30
Nov 22 11:00:38 n2 consul[2715752]: 2022-11-22T11:00:38.250Z [INFO]  agent.server: Adding LAN server: server="ubuntu (Addr: tcp/192.168.2.30:8300) (DC: dc1)"
Nov 22 11:00:39 n2 consul[2715752]: 2022-11-22T11:00:39.573Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 22 11:00:41 n2 consul[2715752]: 2022-11-22T11:00:41.930Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 22 11:00:42 n2 consul[2715752]: 2022-11-22T11:00:42.053Z [INFO]  agent.server: New leader elected: payload=n1
Nov 22 11:00:42 n2 consul[2715752]: 2022-11-22T11:00:42.557Z [WARN]  agent.server.raft: rejecting vote request since node is not a voter: from=192.168.2.30:8300
Nov 22 11:00:42 n2 consul[2715752]: 2022-11-22T11:00:42.558Z [ERROR] agent.server.raft: failed to flush response: error="write tcp 192.168.2.102:8300->192.168.2.101:60581: write: broken pipe"

And at the same time on another node

Nov 22 01:01:03 ubuntu consul[846683]: 2022-11-22T01:01:03.649Z [ERROR] agent.server.raft: failed to flush response: error="write tcp 192.168.2.30:8300->192.168.2.101:43637: write: broken pipe"
Nov 22 02:18:23 ubuntu consul[846683]: 2022-11-22T02:18:23.290Z [INFO]  agent.server.raft: starting snapshot up to: index=16387
Nov 22 02:18:23 ubuntu consul[846683]: 2022-11-22T02:18:23.290Z [INFO]  agent.server.snapshot: creating new snapshot: path=/opt/consul/raft/snapshots/6-16387-1669083503290.tmp
Nov 22 02:18:23 ubuntu consul[846683]: 2022-11-22T02:18:23.409Z [INFO]  agent.server.raft: compacting logs: from=1 to=6147
Nov 22 02:18:23 ubuntu consul[846683]: 2022-11-22T02:18:23.689Z [INFO]  agent.server.raft: snapshot complete up to: index=16387
Nov 22 11:00:22 ubuntu consul[846683]: 2022-11-22T11:00:22.648Z [INFO]  agent.server.memberlist.lan: memberlist: Suspect n1 has failed, no acks received
Nov 22 11:00:23 ubuntu consul[846683]: 2022-11-22T11:00:23.016Z [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr=192.168.2.101:8300 last-leader-id=bdd9f217-4eba-d13d-208b-172624a72dd3
Nov 22 11:00:23 ubuntu consul[846683]: 2022-11-22T11:00:23.017Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.30:8300 [Candidate]" term=7
Nov 22 11:00:24 ubuntu consul[846683]: 2022-11-22T11:00:24.433Z [WARN]  agent.server.raft: Election timeout reached, restarting election
Nov 22 11:00:24 ubuntu consul[846683]: 2022-11-22T11:00:24.434Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.30:8300 [Candidate]" term=8
Nov 22 11:00:24 ubuntu consul[846683]: 2022-11-22T11:00:24.649Z [INFO]  agent.server.memberlist.lan: memberlist: Suspect n2 has failed, no acks received
Nov 22 11:00:26 ubuntu consul[846683]: 2022-11-22T11:00:26.261Z [WARN]  agent.server.raft: Election timeout reached, restarting election
Nov 22 11:00:26 ubuntu consul[846683]: 2022-11-22T11:00:26.481Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.30:8300 [Candidate]" term=9
Nov 22 11:00:26 ubuntu consul[846683]: 2022-11-22T11:00:26.650Z [INFO]  agent.server.memberlist.lan: memberlist: Marking n1 as failed, suspect timeout reached (0 peer confirmations)
Nov 22 11:00:26 ubuntu consul[846683]: 2022-11-22T11:00:26.651Z [INFO]  agent.server.serf.lan: serf: EventMemberFailed: n1 192.168.2.101
Nov 22 11:00:26 ubuntu consul[846683]: 2022-11-22T11:00:26.651Z [INFO]  agent.server: Removing LAN server: server="n1 (Addr: tcp/192.168.2.101:8300) (DC: dc1)"
Nov 22 11:00:27 ubuntu consul[846683]: 2022-11-22T11:00:27.650Z [INFO]  agent.server.memberlist.lan: memberlist: Suspect n2 has failed, no acks received
Nov 22 11:00:27 ubuntu consul[846683]: 2022-11-22T11:00:27.880Z [WARN]  agent.server.raft: Election timeout reached, restarting election
Nov 22 11:00:27 ubuntu consul[846683]: 2022-11-22T11:00:27.880Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.30:8300 [Candidate]" term=10
Nov 22 11:00:27 ubuntu consul[846683]: 2022-11-22T11:00:27.900Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=bdd9f217-4eba-d13d-208b-172624a72dd3 fallback=192.168.2.101:8300 error="Could not find address for server id bdd9f217-4eba-d13d-208b-172624a72dd3"
Nov 22 11:00:28 ubuntu consul[846683]: 2022-11-22T11:00:28.651Z [INFO]  agent.server.memberlist.lan: memberlist: Marking n2 as failed, suspect timeout reached (0 peer confirmations)
Nov 22 11:00:28 ubuntu consul[846683]: 2022-11-22T11:00:28.651Z [INFO]  agent.server.serf.lan: serf: EventMemberFailed: n2 192.168.2.102
Nov 22 11:00:28 ubuntu consul[846683]: 2022-11-22T11:00:28.651Z [INFO]  agent.server: Removing LAN server: server="n2 (Addr: tcp/192.168.2.102:8300) (DC: dc1)"
Nov 22 11:00:29 ubuntu consul[846683]: 2022-11-22T11:00:29.120Z [WARN]  agent.server.raft: Election timeout reached, restarting election
Nov 22 11:00:29 ubuntu consul[846683]: 2022-11-22T11:00:29.120Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.30:8300 [Candidate]" term=11
Nov 22 11:00:29 ubuntu consul[846683]: 2022-11-22T11:00:29.130Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=826026f7-6ed8-c6b2-e452-3bd40a62d53a fallback=192.168.2.102:8300 error="Could not find address for server id 826026f7-6ed8-c6b2-e452-3bd40a62d53a"
Nov 22 11:00:29 ubuntu consul[846683]: 2022-11-22T11:00:29.130Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=bdd9f217-4eba-d13d-208b-172624a72dd3 fallback=192.168.2.101:8300 error="Could not find address for server id bdd9f217-4eba-d13d-208b-172624a72dd3"
Nov 22 11:00:29 ubuntu consul[846683]: 2022-11-22T11:00:29.644Z [INFO]  agent.server.memberlist.wan: memberlist: Suspect n1.dc1 has failed, no acks received
Nov 22 11:00:30 ubuntu consul[846683]: 2022-11-22T11:00:30.843Z [WARN]  agent.server.raft: Election timeout reached, restarting election
Nov 22 11:00:30 ubuntu consul[846683]: 2022-11-22T11:00:30.844Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.30:8300 [Candidate]" term=12
Nov 22 11:00:30 ubuntu consul[846683]: 2022-11-22T11:00:30.865Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=826026f7-6ed8-c6b2-e452-3bd40a62d53a fallback=192.168.2.102:8300 error="Could not find address for server id 826026f7-6ed8-c6b2-e452-3bd40a62d53a"
Nov 22 11:00:30 ubuntu consul[846683]: 2022-11-22T11:00:30.865Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=bdd9f217-4eba-d13d-208b-172624a72dd3 fallback=192.168.2.101:8300 error="Could not find address for server id bdd9f217-4eba-d13d-208b-172624a72dd3"
Nov 22 11:00:32 ubuntu consul[846683]: 2022-11-22T11:00:32.348Z [WARN]  agent.server.raft: Election timeout reached, restarting election
Nov 22 11:00:32 ubuntu consul[846683]: 2022-11-22T11:00:32.349Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.30:8300 [Candidate]" term=13
Nov 22 11:00:32 ubuntu consul[846683]: 2022-11-22T11:00:32.359Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=826026f7-6ed8-c6b2-e452-3bd40a62d53a fallback=192.168.2.102:8300 error="Could not find address for server id 826026f7-6ed8-c6b2-e452-3bd40a62d53a"
Nov 22 11:00:32 ubuntu consul[846683]: 2022-11-22T11:00:32.359Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=bdd9f217-4eba-d13d-208b-172624a72dd3 fallback=192.168.2.101:8300 error="Could not find address for server id bdd9f217-4eba-d13d-208b-172624a72dd3"
Nov 22 11:00:32 ubuntu consul[846683]: 2022-11-22T11:00:32.649Z [INFO]  agent.server.memberlist.lan: memberlist: Suspect n2 has failed, no acks received
Nov 22 11:00:33 ubuntu consul[846683]: 2022-11-22T11:00:33.039Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter bdd9f217-4eba-d13d-208b-172624a72dd3 192.168.2.101:8300}" error="read tcp 192.168.2.30:40203->192.168.2.101:8300: i/o timeout"
Nov 22 11:00:33 ubuntu consul[846683]: 2022-11-22T11:00:33.039Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 826026f7-6ed8-c6b2-e452-3bd40a62d53a 192.168.2.102:8300}" error="read tcp 192.168.2.30:37411->192.168.2.102:8300: i/o timeout"
Nov 22 11:00:33 ubuntu consul[846683]: 2022-11-22T11:00:33.719Z [WARN]  agent.server.raft: Election timeout reached, restarting election
Nov 22 11:00:33 ubuntu consul[846683]: 2022-11-22T11:00:33.719Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.30:8300 [Candidate]" term=14
Nov 22 11:00:33 ubuntu consul[846683]: 2022-11-22T11:00:33.736Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=826026f7-6ed8-c6b2-e452-3bd40a62d53a fallback=192.168.2.102:8300 error="Could not find address for server id 826026f7-6ed8-c6b2-e452-3bd40a62d53a"
Nov 22 11:00:33 ubuntu consul[846683]: 2022-11-22T11:00:33.736Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=bdd9f217-4eba-d13d-208b-172624a72dd3 fallback=192.168.2.101:8300 error="Could not find address for server id bdd9f217-4eba-d13d-208b-172624a72dd3"
Nov 22 11:00:34 ubuntu consul[846683]: 2022-11-22T11:00:34.455Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 826026f7-6ed8-c6b2-e452-3bd40a62d53a 192.168.2.102:8300}" error="read tcp 192.168.2.30:39397->192.168.2.102:8300: i/o timeout"
Nov 22 11:00:34 ubuntu consul[846683]: 2022-11-22T11:00:34.456Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter bdd9f217-4eba-d13d-208b-172624a72dd3 192.168.2.101:8300}" error="dial tcp 192.168.2.30:0->192.168.2.101:8300: i/o timeout"
Nov 22 11:00:34 ubuntu consul[846683]: 2022-11-22T11:00:34.954Z [WARN]  agent.server.raft: Election timeout reached, restarting election
Nov 22 11:00:34 ubuntu consul[846683]: 2022-11-22T11:00:34.954Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.30:8300 [Candidate]" term=15
Nov 22 11:00:34 ubuntu consul[846683]: 2022-11-22T11:00:34.967Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=826026f7-6ed8-c6b2-e452-3bd40a62d53a fallback=192.168.2.102:8300 error="Could not find address for server id 826026f7-6ed8-c6b2-e452-3bd40a62d53a"
Nov 22 11:00:34 ubuntu consul[846683]: 2022-11-22T11:00:34.967Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=bdd9f217-4eba-d13d-208b-172624a72dd3 fallback=192.168.2.101:8300 error="Could not find address for server id bdd9f217-4eba-d13d-208b-172624a72dd3"
Nov 22 11:00:35 ubuntu consul[846683]: 2022-11-22T11:00:35.061Z [ERROR] agent.server.raft: failed to flush response: error="write tcp 192.168.2.30:8300->192.168.2.101:40485: write: connection reset by peer"
Nov 22 11:00:35 ubuntu consul[846683]: 2022-11-22T11:00:35.589Z [WARN]  agent.server.memberlist.wan: memberlist: Refuting a suspect message (from: ubuntu.dc1)
Nov 22 11:00:36 ubuntu consul[846683]: 2022-11-22T11:00:36.638Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter bdd9f217-4eba-d13d-208b-172624a72dd3 192.168.2.101:8300}" error="dial tcp 192.168.2.30:0->192.168.2.101:8300: i/o timeout"
Nov 22 11:00:36 ubuntu consul[846683]: 2022-11-22T11:00:36.638Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 826026f7-6ed8-c6b2-e452-3bd40a62d53a 192.168.2.102:8300}" error="dial tcp 192.168.2.30:0->192.168.2.102:8300: i/o timeout"
Nov 22 11:00:36 ubuntu consul[846683]: 2022-11-22T11:00:36.727Z [WARN]  agent.server.raft: Election timeout reached, restarting election
Nov 22 11:00:36 ubuntu consul[846683]: 2022-11-22T11:00:36.727Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.30:8300 [Candidate]" term=16
Nov 22 11:00:36 ubuntu consul[846683]: 2022-11-22T11:00:36.750Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=826026f7-6ed8-c6b2-e452-3bd40a62d53a fallback=192.168.2.102:8300 error="Could not find address for server id 826026f7-6ed8-c6b2-e452-3bd40a62d53a"
Nov 22 11:00:36 ubuntu consul[846683]: 2022-11-22T11:00:36.750Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=bdd9f217-4eba-d13d-208b-172624a72dd3 fallback=192.168.2.101:8300 error="Could not find address for server id bdd9f217-4eba-d13d-208b-172624a72dd3"
Nov 22 11:00:38 ubuntu consul[846683]: 2022-11-22T11:00:38.133Z [WARN]  agent.server.raft: Election timeout reached, restarting election
Nov 22 11:00:38 ubuntu consul[846683]: 2022-11-22T11:00:38.134Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.30:8300 [Candidate]" term=17
Nov 22 11:00:38 ubuntu consul[846683]: 2022-11-22T11:00:38.140Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=826026f7-6ed8-c6b2-e452-3bd40a62d53a fallback=192.168.2.102:8300 error="Could not find address for server id 826026f7-6ed8-c6b2-e452-3bd40a62d53a"
Nov 22 11:00:38 ubuntu consul[846683]: 2022-11-22T11:00:38.140Z [WARN]  agent.server.raft: unable to get address for server, using fallback address: id=bdd9f217-4eba-d13d-208b-172624a72dd3 fallback=192.168.2.101:8300 error="Could not find address for server id bdd9f217-4eba-d13d-208b-172624a72dd3"
Nov 22 11:00:38 ubuntu consul[846683]: 2022-11-22T11:00:38.227Z [INFO]  agent.server.serf.lan: serf: attempting reconnect to n2 192.168.2.102:8301
Nov 22 11:00:38 ubuntu consul[846683]: 2022-11-22T11:00:38.231Z [WARN]  agent.server.memberlist.lan: memberlist: Refuting a suspect message (from: ubuntu)
Nov 22 11:00:38 ubuntu consul[846683]: 2022-11-22T11:00:38.303Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: n2 192.168.2.102
Nov 22 11:00:38 ubuntu consul[846683]: 2022-11-22T11:00:38.304Z [INFO]  agent.server: Adding LAN server: server="n2 (Addr: tcp/192.168.2.102:8300) (DC: dc1)"
Nov 22 11:00:38 ubuntu consul[846683]: 2022-11-22T11:00:38.455Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: n1 192.168.2.101
Nov 22 11:00:38 ubuntu consul[846683]: 2022-11-22T11:00:38.456Z [INFO]  agent.server: Adding LAN server: server="n1 (Addr: tcp/192.168.2.101:8300) (DC: dc1)"
Nov 22 11:00:39 ubuntu consul[846683]: 2022-11-22T11:00:39.562Z [WARN]  agent.server.raft: Election timeout reached, restarting election
Nov 22 11:00:39 ubuntu consul[846683]: 2022-11-22T11:00:39.562Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.30:8300 [Candidate]" term=18
Nov 22 11:00:39 ubuntu consul[846683]: 2022-11-22T11:00:39.646Z [WARN]  agent.server.memberlist.wan: memberlist: Was able to connect to n2.dc1 over TCP but UDP probes failed, network may be misconfigured
Nov 22 11:00:41 ubuntu consul[846683]: 2022-11-22T11:00:41.336Z [WARN]  agent.server.raft: Election timeout reached, restarting election
Nov 22 11:00:41 ubuntu consul[846683]: 2022-11-22T11:00:41.336Z [INFO]  agent.server.raft: entering candidate state: node="Node at 192.168.2.30:8300 [Candidate]" term=19
Nov 22 11:00:42 ubuntu consul[846683]: 2022-11-22T11:00:42.054Z [INFO]  agent.server: New leader elected: payload=n1
Nov 22 11:00:42 ubuntu consul[846683]: 2022-11-22T11:00:42.439Z [WARN]  agent.server.raft: Election timeout reached, restarting election

Oh, and I have autopilot turned off this time.

These logs show a networking glitch.

I also don’t think you actually have the relevant autopilot functionality disabled, as I see the “failed” node ubuntu being forcibly removed from the cluster membership.

Please show the output of

consul operator autopilot get-config

so we can see the actual autopilot configuration in your cluster.

Oy. You’re right. It turns out the Ansible role I was using defaults those on, so when I removed the variables it still enabled it. I’m going to recreate the cluster again and see how things go.