No Cluster Leader when cluster node is down

Hello.
I have some problems with Kubernetes compatibility with the FUSE filesystem and today I’ve installed Consul and Nomad to 2 hosts. So, I have 4 VMs:

  • master
  • node0
  • node1
  • node2
    When I shut down node0 and then try to open UI on the node1 I get the error in my browser:

The cluster has no leader. Read about Outage Recovery.

I’ve installed Consul as a server on the master and installed Consul as an agent on node0, node1, node2. Then I installed nomad as server/clients on node0, node1, node2.
Configuration (node0):

[root@prod-node0 nomad.d]# cat nomad.hcl
datacenter = "dc1"
data_dir = "/opt/nomad"
[root@prod-node0 nomad.d]# cat server.hcl
server {
  enabled = true
  bootstrap_expect = 3
}

consul {
  address             = "127.0.0.1:8500"
  server_service_name = "nomad"
  client_service_name = "nomad-client"
  auto_advertise      = true
  server_auto_join    = true
  client_auto_join    = true
}

bind_addr = "0.0.0.0" 

advertise {
  http = "172.18.0.13"
}

client {
  enabled = true
  servers = ["172.18.0.11", "172.18.0.13"]
}

On other VMs I have the same configuration with another IP advertise block.

Hi @wusikijeronii ,

Thanks for using Nomad. Are you able to ssh into the nodes and grab the logs? What is your bootstrap expect set to?

It feels like you have some sort of a split brain going on, and raft can’t establish a leader. This seems possible since you are running an even number of server nodes. You should always run an odd number of server nodes (3, 5, or 7 are recommended), so that you can establish a quorum effectively. If you are not familiar with Raft or how Nomad uses it, here is a link to the documentation.

Thanks for the fast reply.
Logs from node2 (172.18.0.17. 172.18.0.13 is the switched off node0):

Nov 16 18:10:13 prod-node2 nomad[1006]:     2021-11-16T18:10:13.254+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.13:4647 172.18.0.13:4647}" error="dial tcp 172.18.0.13:4647: connect: no route to hos
t"
Nov 16 18:10:15 prod-node2 nomad[1006]:     2021-11-16T18:10:15.272+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 18:10:15 prod-node2 nomad[1006]:     2021-11-16T18:10:15.272+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.17:4647 [Candidate]" term=1362
Nov 16 18:10:15 prod-node2 nomad[1006]:     2021-11-16T18:10:15.784+0300 [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Nov 16 18:10:16 prod-node2 nomad[1006]:     2021-11-16T18:10:16.006+0300 [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Nov 16 18:10:16 prod-node2 nomad[1006]:     2021-11-16T18:10:16.038+0300 [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Nov 16 18:10:16 prod-node2 nomad[1006]:     2021-11-16T18:10:16.186+0300 [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Nov 16 18:10:16 prod-node2 nomad[1006]:     2021-11-16T18:10:16.326+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.13:4647 172.18.0.13:4647}" error="dial tcp 172.18.0.13:4647: connect: no route to hos
t"
Nov 16 18:10:18 prod-node2 nomad[1006]:     2021-11-16T18:10:18.315+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 18:10:18 prod-node2 nomad[1006]:     2021-11-16T18:10:18.315+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.17:4647 [Candidate]" term=1363
Nov 16 18:10:19 prod-node2 nomad[1006]:     2021-11-16T18:10:19.398+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.13:4647 172.18.0.13:4647}" error="dial tcp 172.18.0.13:4647: connect: no route to hos
t"
Nov 16 18:10:19 prod-node2 nomad[1006]:     2021-11-16T18:10:19.398+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.13:4647 172.18.0.13:4647}" error="dial tcp 172.18.0.13:4647: connect: no route to hos
t"
Nov 16 18:10:20 prod-node2 consul[734]: 2021-11-16T18:10:20.041+0300 [WARN]  agent: Check is now critical: check=_nomad-check-403d39544f6879ac4bb862ab3ff6b1762e491243
Nov 16 18:10:20 prod-node2 consul[734]: 2021-11-16T18:10:20.041+0300 [WARN]  agent: Check is now critical: check=_nomad-check-403d39544f6879ac4bb862ab3ff6b1762e491243
Nov 16 18:10:20 prod-node2 nomad[1006]:     2021-11-16T18:10:20.240+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 18:10:20 prod-node2 nomad[1006]:     2021-11-16T18:10:20.240+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.17:4647 [Candidate]" term=1364
Nov 16 18:10:20 prod-node2 nomad[1006]:     2021-11-16T18:10:20.308+0300 [ERROR] http: request failed: method=GET path=/v1/agent/health?type=server error="{"server":{"ok":false,"message":"No cluster leader"}}" code=500
Nov 16 18:10:21 prod-node2 nomad[1006]:     2021-11-16T18:10:21.514+0300 [ERROR] client.rpc: error performing RPC to server: error="rpc error: No cluster leader" rpc=Node.UpdateAlloc server=0.0.0.0:4647
Nov 16 18:10:21 prod-node2 nomad[1006]:     2021-11-16T18:10:21.515+0300 [ERROR] client.rpc: error performing RPC to server, deadline exceeded, cannot retry: error="rpc error: No cluster leader" rpc=Node.UpdateAlloc server=0.0.0.0:464
7
Nov 16 18:10:21 prod-node2 nomad[1006]:     2021-11-16T18:10:21.515+0300 [ERROR] client: error updating allocations: error="rpc error: No cluster leader"
Nov 16 18:10:21 prod-node2 consul[734]: 2021-11-16T18:10:21.860+0300 [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=172.18.0.12:8300 error="rpc error making call: Permission denied"
Nov 16 18:10:21 prod-node2 consul[734]: 2021-11-16T18:10:21.860+0300 [WARN]  agent: Coordinate update blocked by ACLs: accessorID=
Nov 16 18:10:21 prod-node2 consul[734]: 2021-11-16T18:10:21.860+0300 [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=172.18.0.12:8300 error="rpc error making call: Permission denied"

Nov 16 18:10:21 prod-node2 consul[734]: 2021-11-16T18:10:21.860+0300 [WARN]  agent: Coordinate update blocked by ACLs: accessorID=
Nov 16 18:10:22 prod-node2 nomad[1006]:     2021-11-16T18:10:22.470+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.13:4647 172.18.0.13:4647}" error="dial tcp 172.18.0.13:4647: connect: no route to hos
t"
Nov 16 18:10:23 prod-node2 nomad[1006]:     2021-11-16T18:10:23.714+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 18:10:23 prod-node2 nomad[1006]:     2021-11-16T18:10:23.714+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.17:4647 [Candidate]" term=1365
Nov 16 18:10:26 prod-node2 nomad[1006]:     2021-11-16T18:10:26.869+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 18:10:26 prod-node2 nomad[1006]:     2021-11-16T18:10:26.869+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.17:4647 [Candidate]" term=1366
Nov 16 18:10:28 prod-node2 nomad[1006]:     2021-11-16T18:10:28.870+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.13:4647 172.18.0.13:4647}" error="dial tcp 172.18.0.13:4647: connect: no route to hos
t"
Nov 16 18:10:29 prod-node2 nomad[1006]:     2021-11-16T18:10:29.004+0300 [ERROR] http: request failed: method=GET path=/v1/status/leader?region=global error="No cluster leader" code=500
Nov 16 18:10:30 prod-node2 nomad[1006]:     2021-11-16T18:10:30.917+0300 [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Nov 16 18:10:31 prod-node2 nomad[1006]:     2021-11-16T18:10:31.048+0300 [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Nov 16 18:10:31 prod-node2 nomad[1006]:     2021-11-16T18:10:31.086+0300 [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Nov 16 18:10:31 prod-node2 nomad[1006]:     2021-11-16T18:10:31.206+0300 [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Nov 16 18:10:33 prod-node2 nomad[1006]:     2021-11-16T18:10:33.393+0300 [ERROR] client.rpc: error performing RPC to server: error="rpc error: No cluster leader" rpc=Node.GetClientAllocs server=172.18.0.11:4647
Nov 16 18:10:33 prod-node2 nomad[1006]:     2021-11-16T18:10:33.393+0300 [ERROR] client.rpc: error performing RPC to server, deadline exceeded, cannot retry: error="rpc error: No cluster leader" rpc=Node.GetClientAllocs server=172.18.
0.11:4647
Nov 16 18:10:33 prod-node2 nomad[1006]:     2021-11-16T18:10:33.393+0300 [ERROR] client: error querying node allocations: error="rpc error: No cluster leader"
Nov 16 18:10:33 prod-node2 nomad[1006]:     2021-11-16T18:10:33.554+0300 [ERROR] client.rpc: error performing RPC to server: error="rpc error: No cluster leader" rpc=Node.UpdateAlloc server=172.18.0.11:4647
Nov 16 18:10:33 prod-node2 nomad[1006]:     2021-11-16T18:10:33.554+0300 [ERROR] client.rpc: error performing RPC to server, deadline exceeded, cannot retry: error="rpc error: No cluster leader" rpc=Node.UpdateAlloc server=172.18.0.11
:4647
Nov 16 18:10:33 prod-node2 nomad[1006]:     2021-11-16T18:10:33.554+0300 [ERROR] client: error updating allocations: error="rpc error: No cluster leader"
Nov 16 18:10:34 prod-node2 nomad[1006]:     2021-11-16T18:10:34.144+0300 [INFO]  nomad.raft: entering follower state: follower="Node at 172.18.0.17:4647 [Follower]" leader=
Nov 16 18:10:34 prod-node2 sshd[1093]: Connection closed by 172.18.0.2 port 33338 [preauth]
Nov 16 18:10:35 prod-node2 consul[734]: 2021-11-16T18:10:35.042+0300 [WARN]  agent: Check is now critical: check=_nomad-check-403d39544f6879ac4bb862ab3ff6b1762e491243
Nov 16 18:10:35 prod-node2 consul[734]: 2021-11-16T18:10:35.042+0300 [WARN]  agent: Check is now critical: check=_nomad-check-403d39544f6879ac4bb862ab3ff6b1762e491243
Nov 16 18:10:35 prod-node2 nomad[1006]:     2021-11-16T18:10:35.166+0300 [ERROR] http: request failed: method=GET path=/v1/agent/health?type=server error="{"server":{"ok":false,"message":"No cluster leader"}}" code=500

bootstrap expect is set to 3. For now, the Nomad is installed on 3 servers (node0, node1, node2)

These lines look really suspicious. It feels like they don’t all have the required/matching config in the server agent configs. Can you paste all the agent configs without secrets?

Hmmmm. It works if I disable acl in consul. Need I to send acl token from nomad to consul? How?
My aclconfig:

{
  "acl": {
    "enabled" : true,
    "default_policy" : "deny",
    "down_policy": "extend-cache",
    "enable_token_persistence" : true,
    "tokens" : {
      "agent": "a73f0b97-49d1-8e30-b334-a85331080579"
    }
  }
}

@DerekStrickland, I tried to use this manual Secure Nomad Jobs with Consul Service Mesh | Nomad - HashiCorp Learn. I tried:

  • Client policy
  • Server policy
  • Add token in the consul block on each server and restart Consul

UPD: After reboot all even without ACL it doesn’t work again.
Now leader is node2. After node2 downing:

Nov 16 19:43:04 prod-node0 nomad[756]:     2021-11-16T19:43:04.648+0300 [ERROR] client.rpc: error performing RPC to server: error="rpc error: No cluster leader" rpc=Node.UpdateAlloc server=172.18.0.11:4647
Nov 16 19:43:04 prod-node0 nomad[756]:     2021-11-16T19:43:04.648+0300 [ERROR] client.rpc: error performing RPC to server, deadline exceeded, cannot retry: error="rpc error: No cluster leader" rpc=Node.UpdateAlloc server=172.18.0.11:
4647
Nov 16 19:43:04 prod-node0 nomad[756]:     2021-11-16T19:43:04.648+0300 [ERROR] client: error updating allocations: error="rpc error: No cluster leader"
Nov 16 19:43:04 prod-node0 nomad[756]:     2021-11-16T19:43:04.972+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 19:43:04 prod-node0 nomad[756]:     2021-11-16T19:43:04.972+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.13:4647 [Candidate]" term=3191
Nov 16 19:43:05 prod-node0 nomad[756]:     2021-11-16T19:43:05.503+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.17:4647 172.18.0.17:4647}" error="dial tcp 172.18.0.17:4647: connect: no route to host
"
Nov 16 19:43:05 prod-node0 nomad[756]:     2021-11-16T19:43:05.503+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.17:4647 172.18.0.17:4647}" error="dial tcp 172.18.0.17:4647: connect: no route to host
"
Nov 16 19:43:06 prod-node0 nomad[756]:     2021-11-16T19:43:06.219+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 19:43:06 prod-node0 nomad[756]:     2021-11-16T19:43:06.219+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.13:4647 [Candidate]" term=3192
Nov 16 19:43:07 prod-node0 nomad[756]:     2021-11-16T19:43:07.595+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 19:43:07 prod-node0 nomad[756]:     2021-11-16T19:43:07.595+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.13:4647 [Candidate]" term=3193
Nov 16 19:43:08 prod-node0 nomad[756]:     2021-11-16T19:43:08.575+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.17:4647 172.18.0.17:4647}" error="dial tcp 172.18.0.17:4647: connect: no route to host
"
Nov 16 19:43:09 prod-node0 nomad[756]:     2021-11-16T19:43:09.256+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 19:43:09 prod-node0 nomad[756]:     2021-11-16T19:43:09.256+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.13:4647 [Candidate]" term=3194
Nov 16 19:43:11 prod-node0 nomad[756]:     2021-11-16T19:43:11.177+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 19:43:11 prod-node0 nomad[756]:     2021-11-16T19:43:11.177+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.13:4647 [Candidate]" term=3195
Nov 16 19:43:11 prod-node0 nomad[756]:     2021-11-16T19:43:11.642+0300 [ERROR] client.rpc: error performing RPC to server: error="rpc error: No cluster leader" rpc=Node.Register server=172.18.0.13:4647
Nov 16 19:43:11 prod-node0 nomad[756]:     2021-11-16T19:43:11.642+0300 [ERROR] client.rpc: error performing RPC to server, deadline exceeded, cannot retry: error="rpc error: No cluster leader" rpc=Node.Register server=172.18.0.13:464
7
Nov 16 19:43:11 prod-node0 nomad[756]:     2021-11-16T19:43:11.642+0300 [ERROR] client: error registering: error="rpc error: No cluster leader"
Nov 16 19:43:11 prod-node0 nomad[756]:     2021-11-16T19:43:11.647+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.17:4647 172.18.0.17:4647}" error="dial tcp 172.18.0.17:4647: connect: no route to host
"
Nov 16 19:43:11 prod-node0 nomad[756]:     2021-11-16T19:43:11.647+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.17:4647 172.18.0.17:4647}" error="dial tcp 172.18.0.17:4647: connect: no route to host
"
Nov 16 19:43:12 prod-node0 nomad[756]:     2021-11-16T19:43:12.371+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 19:43:12 prod-node0 nomad[756]:     2021-11-16T19:43:12.371+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.13:4647 [Candidate]" term=3196
Nov 16 19:43:13 prod-node0 nomad[756]:     2021-11-16T19:43:13.250+0300 [ERROR] http: request failed: method=GET path=/v1/status/leader?region=global error="No cluster leader" code=500
Nov 16 19:43:13 prod-node0 nomad[756]:     2021-11-16T19:43:13.658+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 19:43:13 prod-node0 nomad[756]:     2021-11-16T19:43:13.658+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.13:4647 [Candidate]" term=3197
Nov 16 19:43:14 prod-node0 nomad[756]:     2021-11-16T19:43:14.719+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.17:4647 172.18.0.17:4647}" error="dial tcp 172.18.0.17:4647: connect: no route to host
"
Nov 16 19:43:14 prod-node0 nomad[756]:     2021-11-16T19:43:14.720+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.17:4647 172.18.0.17:4647}" error="dial tcp 172.18.0.17:4647: connect: no route to host
"
Nov 16 19:43:15 prod-node0 nomad[756]:     2021-11-16T19:43:15.719+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 19:43:15 prod-node0 nomad[756]:     2021-11-16T19:43:15.719+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.13:4647 [Candidate]" term=3198
Nov 16 19:43:17 prod-node0 nomad[756]:     2021-11-16T19:43:17.265+0300 [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Nov 16 19:43:17 prod-node0 consul[755]: 2021-11-16T19:43:17.291+0300 [WARN]  agent: Check is now critical: check=_nomad-check-a781316ac3ae3dbed74568df5cbdfb9ef62d4c9f
Nov 16 19:43:17 prod-node0 consul[755]: 2021-11-16T19:43:17.291+0300 [WARN]  agent: Check is now critical: check=_nomad-check-a781316ac3ae3dbed74568df5cbdfb9ef62d4c9f
Nov 16 19:43:17 prod-node0 nomad[756]:     2021-11-16T19:43:17.312+0300 [ERROR] http: request failed: method=GET path=/v1/agent/health?type=server error="{"server":{"ok":false,"message":"No cluster leader"}}" code=500
Nov 16 19:43:17 prod-node0 nomad[756]:     2021-11-16T19:43:17.467+0300 [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Nov 16 19:43:17 prod-node0 nomad[756]:     2021-11-16T19:43:17.700+0300 [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Nov 16 19:43:17 prod-node0 nomad[756]:     2021-11-16T19:43:17.791+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.17:4647 172.18.0.17:4647}" error="dial tcp 172.18.0.17:4647: connect: no route to host
"
Nov 16 19:43:17 prod-node0 nomad[756]:     2021-11-16T19:43:17.819+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 19:43:17 prod-node0 nomad[756]:     2021-11-16T19:43:17.820+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.13:4647 [Candidate]" term=3199
Nov 16 19:43:18 prod-node0 nomad[756]:     2021-11-16T19:43:18.027+0300 [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
Nov 16 19:43:18 prod-node0 nomad[756]:     2021-11-16T19:43:18.268+0300 [ERROR] client.rpc: error performing RPC to server: error="rpc error: No cluster leader" rpc=Node.UpdateAlloc server=0.0.0.0:4647
Nov 16 19:43:18 prod-node0 nomad[756]:     2021-11-16T19:43:18.268+0300 [ERROR] client.rpc: error performing RPC to server, deadline exceeded, cannot retry: error="rpc error: No cluster leader" rpc=Node.UpdateAlloc server=0.0.0.0:4647
Nov 16 19:43:18 prod-node0 nomad[756]:     2021-11-16T19:43:18.268+0300 [ERROR] client: error updating allocations: error="rpc error: No cluster leader"
Nov 16 19:43:19 prod-node0 nomad[756]:     2021-11-16T19:43:19.397+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 19:43:19 prod-node0 nomad[756]:     2021-11-16T19:43:19.397+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.13:4647 [Candidate]" term=3200
Nov 16 19:43:20 prod-node0 nomad[756]:     2021-11-16T19:43:20.864+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.17:4647 172.18.0.17:4647}" error="dial tcp 172.18.0.17:4647: connect: no route to host
"
Nov 16 19:43:20 prod-node0 nomad[756]:     2021-11-16T19:43:20.864+0300 [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 172.18.0.17:4647 172.18.0.17:4647}" error="dial tcp 172.18.0.17:4647: connect: no route to host
"
Nov 16 19:43:21 prod-node0 nomad[756]:     2021-11-16T19:43:21.076+0300 [WARN]  nomad.raft: Election timeout reached, restarting election
Nov 16 19:43:21 prod-node0 nomad[756]:     2021-11-16T19:43:21.076+0300 [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.13:4647 [Candidate]" term=3201

Fixed with peers.json

Glad to hear it! Thanks again for being part of the Nomad Community.