3-server Nomad cluster seems to become unstable after brief network partition of non-leader server?

pcc · September 24, 2022, 8:44pm

I’ve been exploring Nomad with a bare metal cluster of 3 servers and 7 clients. Nomad version is 1.3.5. Everything communicates via a Nebula overlay network on 192.168.100.0/24 to keep the config simple. The 3 servers are all in different DCs in Montreal. RTT between them < 2ms over Nebula. The 7 clients are more dispersed, all < 80ms.

A few days ago the cluster unexpectedly ended up in a state with all 7 clients showing as down, with no services running, yet with all 3 servers apparently being alive. At this point Nomad processes on the clients all seemed to be running without apparent issues, yet restarting the nomad service on each client immediately caused that client to be marked as ready and to receive allocations again.

While going through the logs, it looks like this state seems to have somehow been precipitated by a brief network outage that had occurred at one of the Montreal DCs ie affecting just 1 one of the servers. The affected server was not the elected leader at the time of the connectivity outage. While it looks like connectivity was only briefly out, there was then a period of over an hour in the logs of all 3 servers where there was leadership instability, the cluster seeming to bounce between having no leader and having an elected leader.

I presume this somehow ended up causing the servers to lose track of client status, but I haven’t even gotten that far yet in my analysis, because I’ve since been trying to simulate this type of network partition with one of the servers, and the server behaviour has me really puzzled.

My steps to simulate:

Start with all 3 servers showing as alive. (In the attached examples, at this point vitalia is the leader and solaria and landra are followers). Tail their logs.
Pick one of the servers that is not the leader, in the attached example I’ve picked landra, and stop it’s nebula service
Wait for that server to show up as failed in the UI.
- At this point cluster seems totally fine, nothing unexpected in the logs, work continues despite the 1 failed server.
Now, restart the nebula service on the ‘failed’ server (ie landra)
- At this point things seem to go haywire. Below sample of logs, taken from the point connectivity resumes, shows vitalia losing and regaining leadership in an apparently ongoing cycle.
- This instabiliy seems to continue, at least over a period of minutes, until I intervene, in this example by stopping the nomad service on landra. This immediately seems to stabilize the leadership between the 2 remaining servers. If I start the nomad service on landra again, the instability immediately returns.
- To stabilize, I stop the nomad service on landra, remove /var/nomad, and restart the nomad service; at this point landra rejoins the cluster as follower and everything seems good.

It doesn’t matter which way around I perform this experiment, eg with solaria as leader if I then poke at vitalia, the result is similar.

Below is the salient server config which is being populated similarly to each server via the ansible nomad role. I also tried this variously with rejoin_after_leave = true and raft_multiplier = 5 and the behaviour seems to end up the same, just occurring in slower motion in the latter case.

Why is this happening and what am I missing in the config?

The same 3 machines are also servers in a consul cluster (consul v.1.13.1) and while I haven’t had opportunity to go through the consul logs in depth yet, on the surface the same thing appears to be happening there. The consul cluster is primarily for health checking with the nomad config meant not to be reliant on consul availability for discovery of nomad cluster members.

Example server config (vitalia):

name = "vitalia"
region = "global"
datacenter = "mtl"

enable_debug = false
disable_update_check = false


bind_addr = "192.168.100.70"
advertise {
    http = "192.168.100.70:4646"
    rpc = "192.168.100.70:4647"
    serf = "192.168.100.70:4648"
}
ports {
    http = 4646
    rpc = 4647
    serf = 4648
}


data_dir = "/var/nomad"

log_level = "INFO"
enable_syslog = true

leave_on_terminate = true
leave_on_interrupt = false

consul {
  address = "192.168.100.70:8500"
}

server {
    enabled = true

    bootstrap_expect = 3
    
    start_join = ["192.168.100.30","192.168.100.50","192.168.100.70"]
    rejoin_after_leave = false

    enabled_schedulers = ["service","batch","system"]
    num_schedulers = 12

    node_gc_threshold = "24h"
    eval_gc_threshold = "1h"
    job_gc_threshold = "4h"
    deployment_gc_threshold = "1h"

    encrypt = ""

    raft_protocol = 3
}

Log extract (vitalia):

Sep 21 23:27:22 vitalia nomad[713179]:     2022-09-21T23:27:22.303Z [WARN]  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:22 vitalia nomad[713179]:  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:22 vitalia nomad[713179]:     2022-09-21T23:27:22.417Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 56d28440-9cbc-b967-dc8d-178c9db3154d 192.168.100.50:4647}" error="dial tcp 192.168.100.50:4647: i/o timeout"
Sep 21 23:27:22 vitalia nomad[713179]: nomad.raft: failed to appendEntries to: peer="{Voter 56d28440-9cbc-b967-dc8d-178c9db3154d 192.168.100.50:4647}" error="dial tcp 192.168.100.50:4647: i/o timeout"
Sep 21 23:27:23 vitalia nomad[713179]:     2022-09-21T23:27:23.223Z [ERROR] nomad.raft: failed to heartbeat to: peer=192.168.100.50:4647 error="dial tcp 192.168.100.50:4647: i/o timeout"
Sep 21 23:27:23 vitalia nomad[713179]: nomad.raft: failed to heartbeat to: peer=192.168.100.50:4647 error="dial tcp 192.168.100.50:4647: i/o timeout"
Sep 21 23:27:23 vitalia nomad[713179]:     2022-09-21T23:27:23.637Z [WARN]  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:23 vitalia nomad[713179]:  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:25 vitalia nomad[713179]:     2022-09-21T23:27:25.037Z [WARN]  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:25 vitalia nomad[713179]:  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:26 vitalia nomad[713179]:     2022-09-21T23:27:26.123Z [WARN]  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:26 vitalia nomad[713179]:  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:27 vitalia nomad[713179]:     2022-09-21T23:27:27.975Z [WARN]  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:27 vitalia nomad[713179]:  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:29 vitalia nomad[713179]:     2022-09-21T23:27:29.576Z [WARN]  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:29 vitalia nomad[713179]:  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:30 vitalia nomad[713179]:     2022-09-21T23:27:30.948Z [WARN]  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:30 vitalia nomad[713179]:  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:31 vitalia nomad[713179]:     2022-09-21T23:27:31.875Z [WARN]  nomad: memberlist: Refuting a suspect message (from: solaria.global)
Sep 21 23:27:31 vitalia nomad[713179]:  nomad: memberlist: Refuting a suspect message (from: solaria.global)
Sep 21 23:27:32 vitalia nomad[713179]:     2022-09-21T23:27:32.246Z [INFO]  nomad: serf: EventMemberJoin: landra.global 192.168.100.50
Sep 21 23:27:32 vitalia nomad[713179]:  nomad: serf: EventMemberJoin: landra.global 192.168.100.50
Sep 21 23:27:32 vitalia nomad[713179]:     2022-09-21T23:27:32.246Z [INFO]  nomad: adding server: server="landra.global (Addr: 192.168.100.50:4647) (DC: mtl)"
Sep 21 23:27:32 vitalia nomad[713179]:     2022-09-21T23:27:32.246Z [INFO]  nomad.raft: updating configuration: command=AddNonvoter server-id=56d28440-9cbc-b967-dc8d-178c9db3154d server-addr=192.168.100.50:4647 servers="[{Suffrage:Voter ID:84fbb6ef-b020-7654-40cf-2842347c012e Address:192.168.100.30:4647} {Suffrage:Voter ID:3c8ac5b0-744f-7106-0312-56854f286c8d Address:192.168.100.70:4647} {Suffrage:Nonvoter ID:56d28440-9cbc-b967-dc8d-178c9db3154d Address:192.168.100.50:4647}]"
Sep 21 23:27:32 vitalia nomad[713179]:  nomad: adding server: server="landra.global (Addr: 192.168.100.50:4647) (DC: mtl)"
Sep 21 23:27:32 vitalia nomad[713179]:  nomad.raft: updating configuration: command=AddNonvoter server-id=56d28440-9cbc-b967-dc8d-178c9db3154d server-addr=192.168.100.50:4647 servers="[{Suffrage:Voter ID:84fbb6ef-b020-7654-40cf-2842347c012e Address:192.168.100.30:4647} {Suffrage:Voter ID:3c8ac5b0-744f-7106-0312-56854f286c8d Address:192.168.100.70:4647} {Suffrage:Nonvoter ID:56d28440-9cbc-b967-dc8d-178c9db3154d Address:192.168.100.50:4647}]"
Sep 21 23:27:32 vitalia nomad[713179]:     2022-09-21T23:27:32.249Z [INFO]  nomad.raft: added peer, starting replication: peer=56d28440-9cbc-b967-dc8d-178c9db3154d
Sep 21 23:27:32 vitalia nomad[713179]:  nomad.raft: added peer, starting replication: peer=56d28440-9cbc-b967-dc8d-178c9db3154d
Sep 21 23:27:32 vitalia nomad[713179]:     2022-09-21T23:27:32.253Z [ERROR] nomad.raft: peer has newer term, stopping replication: peer="{Nonvoter 56d28440-9cbc-b967-dc8d-178c9db3154d 192.168.100.50:4647}"
Sep 21 23:27:32 vitalia nomad[713179]:     2022-09-21T23:27:32.253Z [INFO]  nomad.raft: entering follower state: follower="Node at 192.168.100.70:4647 [Follower]" leader-address= leader-id=
Sep 21 23:27:32 vitalia nomad[713179]:     2022-09-21T23:27:32.253Z [ERROR] nomad: failed to add raft peer: error="leadership lost while committing log"
Sep 21 23:27:32 vitalia nomad[713179]: nomad.raft: peer has newer term, stopping replication: peer="{Nonvoter 56d28440-9cbc-b967-dc8d-178c9db3154d 192.168.100.50:4647}"
Sep 21 23:27:32 vitalia nomad[713179]:     2022-09-21T23:27:32.253Z [ERROR] nomad: failed to reconcile member: member="{landra.global 192.168.100.50 4648 map[build:1.3.5 dc:mtl expect:3 id:56d28440-9cbc-b967-dc8d-178c9db3154d port:4647 raft_vsn:3 region:global role:nomad rpc_addr:192.168.100.50 vsn:1] alive 1 5 2 2 5 4}" error="leadership lost while committing log"
Sep 21 23:27:32 vitalia nomad[713179]:     2022-09-21T23:27:32.253Z [INFO]  nomad.raft: aborting pipeline replication: peer="{Voter 84fbb6ef-b020-7654-40cf-2842347c012e 192.168.100.30:4647}"
Sep 21 23:27:32 vitalia nomad[713179]:     2022-09-21T23:27:32.254Z [INFO]  nomad: cluster leadership lost
Sep 21 23:27:32 vitalia nomad[713179]:     2022-09-21T23:27:32.254Z [ERROR] worker: failed to dequeue evaluation: worker_id=908a1e55-5eed-d38e-e05f-2c2cc6a7a725 error="eval broker disabled"
Sep 21 23:27:32 vitalia nomad[713179]:  nomad.raft: entering follower state: follower="Node at 192.168.100.70:4647 [Follower]" leader-address= leader-id=
Sep 21 23:27:32 vitalia nomad[713179]:     2022-09-21T23:27:32.254Z [ERROR] worker: failed to dequeue evaluation: worker_id=e35c5613-a3d4-9eaf-cb5a-2f4683082208 error="eval broker disabled"
Sep 21 23:27:32 vitalia nomad[713179]:     2022-09-21T23:27:32.254Z [ERROR] worker: failed to dequeue evaluation: worker_id=9efeeeb7-e3ad-eb46-4f22-6ed3d86c3435 error="eval broker disabled"
Sep 21 23:27:32 vitalia nomad[713179]: nomad: failed to add raft peer: error="leadership lost while committing log"
Sep 21 23:27:32 vitalia nomad[713179]: nomad: failed to reconcile member: member="{landra.global 192.168.100.50 4648 map[build:1.3.5 dc:mtl expect:3 id:56d28440-9cbc-b967-dc8d-178c9db3154d port:4647 raft_vsn:3 region:global role:nomad rpc_addr:192.168.100.50 vsn:1] alive 1 5 2 2 5 4}" error="leadership lost while committing log"
Sep 21 23:27:32 vitalia nomad[713179]:  nomad.raft: aborting pipeline replication: peer="{Voter 84fbb6ef-b020-7654-40cf-2842347c012e 192.168.100.30:4647}"
Sep 21 23:27:32 vitalia nomad[713179]:  nomad: cluster leadership lost
Sep 21 23:27:32 vitalia nomad[713179]: worker: failed to dequeue evaluation: worker_id=908a1e55-5eed-d38e-e05f-2c2cc6a7a725 error="eval broker disabled"
Sep 21 23:27:32 vitalia nomad[713179]: worker: failed to dequeue evaluation: worker_id=e35c5613-a3d4-9eaf-cb5a-2f4683082208 error="eval broker disabled"
Sep 21 23:27:32 vitalia nomad[713179]: worker: failed to dequeue evaluation: worker_id=9efeeeb7-e3ad-eb46-4f22-6ed3d86c3435 error="eval broker disabled"
Sep 21 23:27:32 vitalia nomad[713179]:     2022-09-21T23:27:32.734Z [WARN]  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:32 vitalia nomad[713179]:  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:33 vitalia nomad[713179]:     2022-09-21T23:27:33.384Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
Sep 21 23:27:33 vitalia nomad[713179]:     2022-09-21T23:27:33.384Z [INFO]  nomad.raft: entering candidate state: node="Node at 192.168.100.70:4647 [Candidate]" term=3441
Sep 21 23:27:33 vitalia nomad[713179]:  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
Sep 21 23:27:33 vitalia nomad[713179]:  nomad.raft: entering candidate state: node="Node at 192.168.100.70:4647 [Candidate]" term=3441
Sep 21 23:27:33 vitalia nomad[713179]:     2022-09-21T23:27:33.406Z [INFO]  nomad.raft: election won: tally=2
Sep 21 23:27:33 vitalia nomad[713179]:     2022-09-21T23:27:33.406Z [INFO]  nomad.raft: entering leader state: leader="Node at 192.168.100.70:4647 [Leader]"
Sep 21 23:27:33 vitalia nomad[713179]:  nomad.raft: election won: tally=2
Sep 21 23:27:33 vitalia nomad[713179]:     2022-09-21T23:27:33.406Z [INFO]  nomad.raft: added peer, starting replication: peer=84fbb6ef-b020-7654-40cf-2842347c012e
Sep 21 23:27:33 vitalia nomad[713179]:     2022-09-21T23:27:33.406Z [INFO]  nomad.raft: added peer, starting replication: peer=56d28440-9cbc-b967-dc8d-178c9db3154d
Sep 21 23:27:33 vitalia nomad[713179]:     2022-09-21T23:27:33.406Z [INFO]  nomad: cluster leadership acquired
Sep 21 23:27:33 vitalia nomad[713179]:  nomad.raft: entering leader state: leader="Node at 192.168.100.70:4647 [Leader]"
Sep 21 23:27:33 vitalia nomad[713179]:  nomad.raft: added peer, starting replication: peer=84fbb6ef-b020-7654-40cf-2842347c012e
Sep 21 23:27:33 vitalia nomad[713179]:  nomad.raft: added peer, starting replication: peer=56d28440-9cbc-b967-dc8d-178c9db3154d
Sep 21 23:27:33 vitalia nomad[713179]:  nomad: cluster leadership acquired
Sep 21 23:27:33 vitalia nomad[713179]:     2022-09-21T23:27:33.408Z [ERROR] nomad.raft: peer has newer term, stopping replication: peer="{Nonvoter 56d28440-9cbc-b967-dc8d-178c9db3154d 192.168.100.50:4647}"
Sep 21 23:27:33 vitalia nomad[713179]:     2022-09-21T23:27:33.408Z [INFO]  nomad.raft: pipelining replication: peer="{Voter 84fbb6ef-b020-7654-40cf-2842347c012e 192.168.100.30:4647}"
Sep 21 23:27:33 vitalia nomad[713179]:     2022-09-21T23:27:33.408Z [INFO]  nomad.raft: entering follower state: follower="Node at 192.168.100.70:4647 [Follower]" leader-address= leader-id=
Sep 21 23:27:33 vitalia nomad[713179]: nomad.raft: peer has newer term, stopping replication: peer="{Nonvoter 56d28440-9cbc-b967-dc8d-178c9db3154d 192.168.100.50:4647}"
Sep 21 23:27:33 vitalia nomad[713179]:     2022-09-21T23:27:33.408Z [ERROR] nomad: failed to wait for barrier: error="leadership lost while committing log"
Sep 21 23:27:33 vitalia nomad[713179]:     2022-09-21T23:27:33.409Z [INFO]  nomad: cluster leadership lost
Sep 21 23:27:33 vitalia nomad[713179]:     2022-09-21T23:27:33.408Z [INFO]  nomad.raft: aborting pipeline replication: peer="{Voter 84fbb6ef-b020-7654-40cf-2842347c012e 192.168.100.30:4647}"
Sep 21 23:27:33 vitalia nomad[713179]:  nomad.raft: pipelining replication: peer="{Voter 84fbb6ef-b020-7654-40cf-2842347c012e 192.168.100.30:4647}"
Sep 21 23:27:33 vitalia nomad[713179]:  nomad.raft: entering follower state: follower="Node at 192.168.100.70:4647 [Follower]" leader-address= leader-id=
Sep 21 23:27:33 vitalia nomad[713179]: nomad: failed to wait for barrier: error="leadership lost while committing log"
Sep 21 23:27:33 vitalia nomad[713179]:  nomad: cluster leadership lost
Sep 21 23:27:33 vitalia nomad[713179]:  nomad.raft: aborting pipeline replication: peer="{Voter 84fbb6ef-b020-7654-40cf-2842347c012e 192.168.100.30:4647}"
Sep 21 23:27:34 vitalia nomad[713179]:     2022-09-21T23:27:34.380Z [WARN]  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:34 vitalia nomad[713179]:  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:35 vitalia nomad[713179]:     2022-09-21T23:27:35.632Z [WARN]  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:35 vitalia nomad[713179]:  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:37 vitalia nomad[713179]:     2022-09-21T23:27:37.397Z [WARN]  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:37 vitalia nomad[713179]:  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:37 vitalia nomad[713179]:     2022-09-21T23:27:37.479Z [ERROR] raft-net: failed to flush response: error="write tcp 192.168.100.70:4647->192.168.100.30:45518: write: broken pipe"
Sep 21 23:27:37 vitalia nomad[713179]: raft-net: failed to flush response: error="write tcp 192.168.100.70:4647->192.168.100.30:45518: write: broken pipe"
Sep 21 23:27:38 vitalia nomad[713179]:     2022-09-21T23:27:38.604Z [WARN]  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:38 vitalia nomad[713179]:  nomad.raft: rejecting vote request since node is not a voter: from=192.168.100.50:4647
Sep 21 23:27:39 vitalia nomad[713179]:     2022-09-21T23:27:39.964Z [ERROR] http: request failed: method=GET path=/v1/agent/health?type=server error="{\"server\":{\"ok\":false,\"message\":\"rpc error: No cluster leader\"}}" code=500
Sep 21 23:27:39 vitalia nomad[713179]: http: request failed: method=GET path=/v1/agent/health?type=server error="{\"server\":{\"ok\":false,\"message\":\"rpc error: No cluster leader\"}}" code=500
Sep 21 23:27:39 vitalia nomad[713179]:     2022-09-21T23:27:39.965Z [ERROR] worker: failed to dequeue evaluation: worker_id=8edb7287-e6cb-61fa-17d4-47841e9616bd error="rpc error: No cluster leader"
Sep 21 23:27:39 vitalia nomad[713179]: worker: failed to dequeue evaluation: worker_id=8edb7287-e6cb-61fa-17d4-47841e9616bd error="rpc error: No cluster leader"
Sep 21 23:27:40 vitalia nomad[713179]:     2022-09-21T23:27:40.024Z [ERROR] worker: failed to dequeue evaluation: worker_id=c122a0b3-da45-ab73-27a6-b0fe54c111c9 error="rpc error: No cluster leader"
Sep 21 23:27:40 vitalia nomad[713179]: worker: failed to dequeue evaluation: worker_id=c122a0b3-da45-ab73-27a6-b0fe54c111c9 error="rpc error: No cluster leader"
Sep 21 23:27:40 vitalia nomad[713179]:     2022-09-21T23:27:40.035Z [ERROR] worker: failed to dequeue evaluation: worker_id=a6b74e8d-0e34-d429-4560-2a9e6aac1f00 error="rpc error: No cluster leader"
Sep 21 23:27:40 vitalia nomad[713179]: worker: failed to dequeue evaluation: worker_id=a6b74e8d-0e34-d429-4560-2a9e6aac1f00 error="rpc error: No cluster leader"
Sep 21 23:27:40 vitalia nomad[713179]:     2022-09-21T23:27:40.041Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr=192.168.100.30:4647 last-leader-id=84fbb6ef-b020-7654-40cf-2842347c012e
Sep 21 23:27:40 vitalia nomad[713179]:     2022-09-21T23:27:40.042Z [INFO]  nomad.raft: entering candidate state: node="Node at 192.168.100.70:4647 [Candidate]" term=3446
Sep 21 23:27:40 vitalia nomad[713179]:  nomad.raft: heartbeat timeout reached, starting election: last-leader-addr=192.168.100.30:4647 last-leader-id=84fbb6ef-b020-7654-40cf-2842347c012e
Sep 21 23:27:40 vitalia nomad[713179]:  nomad.raft: entering candidate state: node="Node at 192.168.100.70:4647 [Candidate]" term=3446
Sep 21 23:27:40 vitalia nomad[713179]:     2022-09-21T23:27:40.064Z [INFO]  nomad.raft: election won: tally=2
Sep 21 23:27:40 vitalia nomad[713179]:     2022-09-21T23:27:40.064Z [INFO]  nomad.raft: entering leader state: leader="Node at 192.168.100.70:4647 [Leader]"
Sep 21 23:27:40 vitalia nomad[713179]:     2022-09-21T23:27:40.064Z [INFO]  nomad.raft: added peer, starting replication: peer=84fbb6ef-b020-7654-40cf-2842347c012e
Sep 21 23:27:40 vitalia nomad[713179]:     2022-09-21T23:27:40.064Z [INFO]  nomad.raft: added peer, starting replication: peer=56d28440-9cbc-b967-dc8d-178c9db3154d
Sep 21 23:27:40 vitalia nomad[713179]:     2022-09-21T23:27:40.064Z [INFO]  nomad: cluster leadership acquired
Sep 21 23:27:40 vitalia nomad[713179]:  nomad.raft: election won: tally=2
Sep 21 23:27:40 vitalia nomad[713179]:  nomad.raft: entering leader state: leader="Node at 192.168.100.70:4647 [Leader]"
Sep 21 23:27:40 vitalia nomad[713179]:  nomad.raft: added peer, starting replication: peer=84fbb6ef-b020-7654-40cf-2842347c012e
Sep 21 23:27:40 vitalia nomad[713179]:  nomad.raft: added peer, starting replication: peer=56d28440-9cbc-b967-dc8d-178c9db3154d
Sep 21 23:27:40 vitalia nomad[713179]:  nomad: cluster leadership acquired
Sep 21 23:27:40 vitalia nomad[713179]:     2022-09-21T23:27:40.066Z [ERROR] nomad.raft: peer has newer term, stopping replication: peer="{Nonvoter 56d28440-9cbc-b967-dc8d-178c9db3154d 192.168.100.50:4647}"
Sep 21 23:27:40 vitalia nomad[713179]: nomad.raft: peer has newer term, stopping replication: peer="{Nonvoter 56d28440-9cbc-b967-dc8d-178c9db3154d 192.168.100.50:4647}"
Sep 21 23:27:40 vitalia nomad[713179]:     2022-09-21T23:27:40.067Z [INFO]  nomad.raft: entering follower state: follower="Node at 192.168.100.70:4647 [Follower]" leader-address= leader-id=
Sep 21 23:27:40 vitalia nomad[713179]:     2022-09-21T23:27:40.067Z [ERROR] nomad: failed to wait for barrier: error="leadership lost while committing log"
Sep 21 23:27:40 vitalia nomad[713179]:     2022-09-21T23:27:40.067Z [INFO]  nomad: cluster leadership lost
Sep 21 23:27:40 vitalia nomad[713179]:  nomad.raft: entering follower state: follower="Node at 192.168.100.70:4647 [Follower]" leader-address= leader-id=
Sep 21 23:27:40 vitalia nomad[713179]: nomad: failed to wait for barrier: error="leadership lost while committing log"
Sep 21 23:27:40 vitalia nomad[713179]:  nomad: cluster leadership lost

pcc · October 13, 2022, 5:57pm

Posting an update here for anyone else that runs into this issue.

Indeed Consul, independently, was behaving in the same manner. Through conversation on a related thread in the Consul category, the culprit seems to be the default autopilot config to cleanup dead servers.

Thus, one option that seems to avoid this behaviour is:
nomad operator autopilot set-config -cleanup-dead-servers=false

pcc · October 27, 2022, 11:38pm

Further update for posterity: sounds like this could be related to Unstable leadership when running server is demoted/removed without its participation · Issue #524 · hashicorp/raft · GitHub which looks like it was addressed in today’s nomad releases.

Topic		Replies	Views
3-server cluster becomes unstable when follower recovers from temporary network outage? Consul	5	3078	October 31, 2022
No Cluster Leader - Nomad Nomad	20	2531	January 19, 2024
No Cluster Leader when cluster node is down Nomad	6	4290	November 17, 2021
Nomad server failure caused reallocation Nomad	0	356	April 27, 2022
Debugging loss of leader Nomad	2	156	May 31, 2024

3-server Nomad cluster seems to become unstable after brief network partition of non-leader server?

Related topics