I have a 3-server cluster with Consul (1.16.1) and Nomad (1.6.1) running on all nodes. If one of the nodes goes down, it has trouble re-joining the cluster. On the node that is trying to re-join, I’m seeing
nomad.raft: Election timeout reached, restarting election
nomad.raft: entering candidate state: node="Node at 192.168.1.123:4647 [Candidate]" term=481
while the logs on one of the servers in the cluster I’m seeing
nomad[11322]: 2023-09-08T13:53:16.993-0700 [INFO] nomad: memberlist: Marking node-that-went-down.global as failed, suspect timeout reached (0 peer confirmations)
nomad[11322]: 2023-09-08T13:53:16.994-0700 [INFO] nomad: serf: EventMemberFailed: node-that-went-down.global 192.168.1.123
nomad[11322]: 2023-09-08T13:53:16.995-0700 [INFO] nomad: removing server: server="node-that-went-down.global (Addr: 192.168.1.123:4647) (DC: dc1)"
nomad[11322]: 2023-09-08T13:53:16.996-0700 [INFO] nomad: memberlist: Suspect node-that-went-down.global has failed, no acks received
nomad[11322]: 2023-09-08T13:53:18.586-0700 [WARN] nomad.raft: rejecting vote request since we have a leader: from=192.168.1.123:4647 leader=100.104.245.24:4647 leader-id=f3e3cce9-8ecb-5d8b-3102-fcd71389ab44
nomad[11322]: 2023-09-08T13:53:20.053-0700 [WARN] nomad.raft: rejecting vote request since we have a leader: from=192.168.1.123:4647 leader=100.104.245.24:4647 leader-id=f3e3cce9-8ecb-5d8b-3102-fcd71389ab44
nomad[11322]: 2023-09-08T13:53:21.938-0700 [WARN] nomad.raft: rejecting vote request since we have a leader: from=192.168.1.123:4647 leader=100.104.245.24:4647 leader-id=f3e3cce9-8ecb-5d8b-3102-fcd71389ab44
nomad[11322]: 2023-09-08T13:53:23.367-0700 [WARN] nomad.raft: rejecting vote request since node is not in configuration: from=192.168.1.123:4647
nomad[11322]: 2023-09-08T13:53:23.373-0700 [INFO] nomad: serf: EventMemberLeave (forced): node-that-went-down.global 192.168.1.123
nomad[11322]: 2023-09-08T13:53:23.373-0700 [INFO] nomad: removing server: server="subspace-compute1.global (Addr: 192.168.1.123:4647) (DC: dc1)"
nomad[11322]: 2023-09-08T13:53:25.221-0700 [WARN] nomad.raft: rejecting vote request since node is not in configuration: from=192.168.1.123:4647
nomad[11322]: 2023-09-08T13:53:26.340-0700 [WARN] nomad.raft: rejecting vote request since node is not in configuration: from=192.168.1.123:4647
nomad[11322]: 2023-09-08T13:53:28.213-0700 [WARN] nomad.raft: rejecting vote request since node is not in configuration: from=192.168.1.123:4647
nomad[11322]: 2023-09-08T13:53:29.832-0700 [WARN] nomad.raft: rejecting vote request since node is not in configuration: from=192.168.1.123:4647
The node only went down for about 5 minutes. I looked up the issue and it appeared to be resolved in an earlier version. How can I resolve this so if a server node were to go down, it can automatically heal and re-join the cluster? This only seems to be an issue with Nomad, the Consul cluster seems to be okay.
Here’s my nomad.hcl
config
advertise {
http = "{{ GetInterfaceIP \"tailscale0\" }}"
rpc = "{{ GetInterfaceIP \"tailscale0\" }}"
serf = "{{ GetInterfaceIP \"tailscale0\" }}"
}
limits {
# Disable client connection rate limiting which was causing lots of requests to time out via Web UI
http_max_conns_per_client = 0
rpc_max_conns_per_client = 0
}
client {
enabled = true
network_interface = "tailscale0"
}
server {
enabled = true
bootstrap_expect = 3
default_scheduler_config {
# Allows usage of memory_max
memory_oversubscription_enabled = true
}
}
ui {
enabled = true
}
vault {
enabled = true
address = "http://vault.service.consul:8200"
}
consul {
address = "127.0.0.1:8500"
}
Thanks