Getting "rejecting vote request since we have a leader" if a server node goes down and tries to re-join

axsuul · September 8, 2023, 8:51pm

I have a 3-server cluster with Consul (1.16.1) and Nomad (1.6.1) running on all nodes. If one of the nodes goes down, it has trouble re-joining the cluster. On the node that is trying to re-join, I’m seeing

nomad.raft: Election timeout reached, restarting election
nomad.raft: entering candidate state: node="Node at 192.168.1.123:4647 [Candidate]" term=481

while the logs on one of the servers in the cluster I’m seeing

nomad[11322]:     2023-09-08T13:53:16.993-0700 [INFO]  nomad: memberlist: Marking node-that-went-down.global as failed, suspect timeout reached (0 peer confirmations)
nomad[11322]:     2023-09-08T13:53:16.994-0700 [INFO]  nomad: serf: EventMemberFailed: node-that-went-down.global 192.168.1.123
nomad[11322]:     2023-09-08T13:53:16.995-0700 [INFO]  nomad: removing server: server="node-that-went-down.global (Addr: 192.168.1.123:4647) (DC: dc1)"
nomad[11322]:     2023-09-08T13:53:16.996-0700 [INFO]  nomad: memberlist: Suspect node-that-went-down.global has failed, no acks received
nomad[11322]:     2023-09-08T13:53:18.586-0700 [WARN]  nomad.raft: rejecting vote request since we have a leader: from=192.168.1.123:4647 leader=100.104.245.24:4647 leader-id=f3e3cce9-8ecb-5d8b-3102-fcd71389ab44
nomad[11322]:     2023-09-08T13:53:20.053-0700 [WARN]  nomad.raft: rejecting vote request since we have a leader: from=192.168.1.123:4647 leader=100.104.245.24:4647 leader-id=f3e3cce9-8ecb-5d8b-3102-fcd71389ab44
nomad[11322]:     2023-09-08T13:53:21.938-0700 [WARN]  nomad.raft: rejecting vote request since we have a leader: from=192.168.1.123:4647 leader=100.104.245.24:4647 leader-id=f3e3cce9-8ecb-5d8b-3102-fcd71389ab44
nomad[11322]:     2023-09-08T13:53:23.367-0700 [WARN]  nomad.raft: rejecting vote request since node is not in configuration: from=192.168.1.123:4647
nomad[11322]:     2023-09-08T13:53:23.373-0700 [INFO]  nomad: serf: EventMemberLeave (forced): node-that-went-down.global 192.168.1.123
nomad[11322]:     2023-09-08T13:53:23.373-0700 [INFO]  nomad: removing server: server="subspace-compute1.global (Addr: 192.168.1.123:4647) (DC: dc1)"
nomad[11322]:     2023-09-08T13:53:25.221-0700 [WARN]  nomad.raft: rejecting vote request since node is not in configuration: from=192.168.1.123:4647
nomad[11322]:     2023-09-08T13:53:26.340-0700 [WARN]  nomad.raft: rejecting vote request since node is not in configuration: from=192.168.1.123:4647
nomad[11322]:     2023-09-08T13:53:28.213-0700 [WARN]  nomad.raft: rejecting vote request since node is not in configuration: from=192.168.1.123:4647
nomad[11322]:     2023-09-08T13:53:29.832-0700 [WARN]  nomad.raft: rejecting vote request since node is not in configuration: from=192.168.1.123:4647

The node only went down for about 5 minutes. I looked up the issue and it appeared to be resolved in an earlier version. How can I resolve this so if a server node were to go down, it can automatically heal and re-join the cluster? This only seems to be an issue with Nomad, the Consul cluster seems to be okay.

Here’s my nomad.hcl config

advertise {
  http = "{{ GetInterfaceIP \"tailscale0\" }}"
  rpc = "{{ GetInterfaceIP \"tailscale0\" }}"
  serf = "{{ GetInterfaceIP \"tailscale0\" }}"
}

limits {
  # Disable client connection rate limiting which was causing lots of requests to time out via Web UI
  http_max_conns_per_client = 0
  rpc_max_conns_per_client = 0
}

client {
  enabled = true
  network_interface = "tailscale0"
}

server {
  enabled = true
  bootstrap_expect = 3

  default_scheduler_config {
    # Allows usage of memory_max
    memory_oversubscription_enabled = true
  }
}

ui {
  enabled = true
}

vault {
  enabled = true
  address = "http://vault.service.consul:8200"
}

consul {
  address = "127.0.0.1:8500"
}

Thanks

axsuul · September 8, 2023, 9:02pm

Actually I think the issue was because I forgot to add this to my config since it was using its internal IP instead to broadcast to the cluster

addresses {
  http = "{{ GetInterfaceIP \"tailscale0\" }}"
  rpc = "{{ GetInterfaceIP \"tailscale0\" }}"
  serf = "{{ GetInterfaceIP \"tailscale0\" }}"
}

Hope that was the right solution.

jrasell · September 11, 2023, 11:40am

Hi @axsuul,

Looking over your configuration and the logs you provided, I believe the addition config entry would be what is required to fix the error you’re seeing.

Thanks,
jrasell and the Nomad team

Topic		Replies	Views
Consul failing to commit leader election results Consul	9	1725	November 22, 2022
No Cluster Leader when cluster node is down Nomad	6	4079	November 17, 2021
Consul fails to ignore dead agent (powered off after power outage) Consul	3	467	August 19, 2022
Failed leadership election with three node cluster in GKE (Consul v1.5.2) Consul	4	369	February 20, 2023
Nomad is down when there is a peer.json file Nomad	0	653	December 16, 2021

Getting "rejecting vote request since we have a leader" if a server node goes down and tries to re-join

Related topics