Getting "rejecting vote request since we have a leader" if a server node goes down and tries to re-join

I have a 3-server cluster with Consul (1.16.1) and Nomad (1.6.1) running on all nodes. If one of the nodes goes down, it has trouble re-joining the cluster. On the node that is trying to re-join, I’m seeing

nomad.raft: Election timeout reached, restarting election
nomad.raft: entering candidate state: node="Node at 192.168.1.123:4647 [Candidate]" term=481

while the logs on one of the servers in the cluster I’m seeing

nomad[11322]:     2023-09-08T13:53:16.993-0700 [INFO]  nomad: memberlist: Marking node-that-went-down.global as failed, suspect timeout reached (0 peer confirmations)
nomad[11322]:     2023-09-08T13:53:16.994-0700 [INFO]  nomad: serf: EventMemberFailed: node-that-went-down.global 192.168.1.123
nomad[11322]:     2023-09-08T13:53:16.995-0700 [INFO]  nomad: removing server: server="node-that-went-down.global (Addr: 192.168.1.123:4647) (DC: dc1)"
nomad[11322]:     2023-09-08T13:53:16.996-0700 [INFO]  nomad: memberlist: Suspect node-that-went-down.global has failed, no acks received
nomad[11322]:     2023-09-08T13:53:18.586-0700 [WARN]  nomad.raft: rejecting vote request since we have a leader: from=192.168.1.123:4647 leader=100.104.245.24:4647 leader-id=f3e3cce9-8ecb-5d8b-3102-fcd71389ab44
nomad[11322]:     2023-09-08T13:53:20.053-0700 [WARN]  nomad.raft: rejecting vote request since we have a leader: from=192.168.1.123:4647 leader=100.104.245.24:4647 leader-id=f3e3cce9-8ecb-5d8b-3102-fcd71389ab44
nomad[11322]:     2023-09-08T13:53:21.938-0700 [WARN]  nomad.raft: rejecting vote request since we have a leader: from=192.168.1.123:4647 leader=100.104.245.24:4647 leader-id=f3e3cce9-8ecb-5d8b-3102-fcd71389ab44
nomad[11322]:     2023-09-08T13:53:23.367-0700 [WARN]  nomad.raft: rejecting vote request since node is not in configuration: from=192.168.1.123:4647
nomad[11322]:     2023-09-08T13:53:23.373-0700 [INFO]  nomad: serf: EventMemberLeave (forced): node-that-went-down.global 192.168.1.123
nomad[11322]:     2023-09-08T13:53:23.373-0700 [INFO]  nomad: removing server: server="subspace-compute1.global (Addr: 192.168.1.123:4647) (DC: dc1)"
nomad[11322]:     2023-09-08T13:53:25.221-0700 [WARN]  nomad.raft: rejecting vote request since node is not in configuration: from=192.168.1.123:4647
nomad[11322]:     2023-09-08T13:53:26.340-0700 [WARN]  nomad.raft: rejecting vote request since node is not in configuration: from=192.168.1.123:4647
nomad[11322]:     2023-09-08T13:53:28.213-0700 [WARN]  nomad.raft: rejecting vote request since node is not in configuration: from=192.168.1.123:4647
nomad[11322]:     2023-09-08T13:53:29.832-0700 [WARN]  nomad.raft: rejecting vote request since node is not in configuration: from=192.168.1.123:4647

The node only went down for about 5 minutes. I looked up the issue and it appeared to be resolved in an earlier version. How can I resolve this so if a server node were to go down, it can automatically heal and re-join the cluster? This only seems to be an issue with Nomad, the Consul cluster seems to be okay.

Here’s my nomad.hcl config

advertise {
  http = "{{ GetInterfaceIP \"tailscale0\" }}"
  rpc = "{{ GetInterfaceIP \"tailscale0\" }}"
  serf = "{{ GetInterfaceIP \"tailscale0\" }}"
}

limits {
  # Disable client connection rate limiting which was causing lots of requests to time out via Web UI
  http_max_conns_per_client = 0
  rpc_max_conns_per_client = 0
}

client {
  enabled = true
  network_interface = "tailscale0"
}

server {
  enabled = true
  bootstrap_expect = 3

  default_scheduler_config {
    # Allows usage of memory_max
    memory_oversubscription_enabled = true
  }
}

ui {
  enabled = true
}

vault {
  enabled = true
  address = "http://vault.service.consul:8200"
}

consul {
  address = "127.0.0.1:8500"
}

Thanks

Actually I think the issue was because I forgot to add this to my config since it was using its internal IP instead to broadcast to the cluster

addresses {
  http = "{{ GetInterfaceIP \"tailscale0\" }}"
  rpc = "{{ GetInterfaceIP \"tailscale0\" }}"
  serf = "{{ GetInterfaceIP \"tailscale0\" }}"
}

Hope that was the right solution.

Hi @axsuul,

Looking over your configuration and the logs you provided, I believe the addition config entry would be what is required to fix the error you’re seeing.

Thanks,
jrasell and the Nomad team