No cluster leader on single node cluster

I’m trying to setup Nomad as both the server and client just to run a couple of Windows applications in the same machine. No failover or anything like that needed. We just need to make sure the services are running and they’ll get restarted if they crash for some reason.

I can’t possibly find how to get this working without using the dev mode.

Here’s my client.hcl

data_dir  = "C:\\ProgramData\\nomad\\data\\"

log_file = "C:\\ProgramData\\nomad\\logs\\"
log_rotate_duration = "24h"
log_rotate_max_files = 30

bind_addr = "0.0.0.0" # the default

server {
  enabled          = true
  bootstrap_expect = 1
}

client {
  enabled       = true
}

plugin "raw_exec" {
  config {
    enabled = true
  }
}

And this is what I get when starting nomad with the config above. I don’t understand why it can’t elect itself as the leader if there’s only a single one.

PS C:\> nomad agent -config=C:\ProgramData\nomad\conf\client.hcl
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from C:\ProgramData\nomad\conf\client.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 172.18.0.4:4646; RPC: 172.18.0.4:4647; Serf: 172.18.0.4:4648
            Bind Addrs: HTTP: 0.0.0.0:4646; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
                Client: true
             Log Level: INFO
                Region: global (DC: dc1)
                Server: true
               Version: 1.1.2

==> Nomad agent started! Log data will stream in below:

    2021-08-07T12:55:00.235Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=C:\ProgramData\nomad\data\plugins
    2021-08-07T12:55:00.289Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2021-08-07T12:55:00.289Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2021-08-07T12:55:00.289Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2021-08-07T12:55:00.289Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2021-08-07T12:55:00.289Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2021-08-07T12:55:00.319Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:1.2.3.4:4647 Address:1.2.3.4:4647}]"
    2021-08-07T12:55:00.319Z [INFO]  nomad.raft: entering follower state: follower="Node at 172.18.0.4:4647 [Follower]" leader=
    2021-08-07T12:55:00.328Z [INFO]  nomad: serf: EventMemberJoin: site-test-1.global 172.18.0.4
    2021-08-07T12:55:00.328Z [INFO]  nomad: starting scheduling worker(s): num_workers=4 schedulers=[service, batch, system, _core]
    2021-08-07T12:55:00.328Z [WARN]  nomad: serf: Failed to re-join any previously known node
    2021-08-07T12:55:00.328Z [INFO]  nomad: adding server: server="site-test-1.global (Addr: 172.18.0.4:4647) (DC: dc1)"
    2021-08-07T12:55:00.328Z [INFO]  client: using state directory: state_dir=C:\ProgramData\nomad\data\client
    2021-08-07T12:55:00.330Z [INFO]  client: using alloc directory: alloc_dir=C:\ProgramData\nomad\data\alloc
    2021-08-07T12:55:01.718Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=
    2021-08-07T12:55:01.718Z [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.4:4647 [Candidate]" term=1691
    2021-08-07T12:55:03.364Z [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-08-07T12:55:03.365Z [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.4:4647 [Candidate]" term=1692
    2021-08-07T12:55:04.572Z [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-08-07T12:55:04.572Z [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.4:4647 [Candidate]" term=1693
    2021-08-07T12:55:06.195Z [WARN]  client.fingerprint_mgr.network: couldn't split LinkSpeed output: output=
    2021-08-07T12:55:06.395Z [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-08-07T12:55:06.395Z [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.4:4647 [Candidate]" term=1694
    2021-08-07T12:55:08.068Z [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-08-07T12:55:08.068Z [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.4:4647 [Candidate]" term=1695
    2021-08-07T12:55:08.199Z [INFO]  client.plugin: starting plugin manager: plugin-type=csi
    2021-08-07T12:55:08.199Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
    2021-08-07T12:55:08.199Z [INFO]  client.plugin: starting plugin manager: plugin-type=device
    2021-08-07T12:55:08.254Z [WARN]  client.driver_mgr.docker: Docker is configured with Linux containers; switch to Windows Containers: driver=docker
    2021-08-07T12:55:08.254Z [INFO]  client: started client: node_id=1d590162-2d2e-d6ed-33b6-77717049bbe0
    2021-08-07T12:55:09.484Z [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-08-07T12:55:09.485Z [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.4:4647 [Candidate]" term=1696
    2021-08-07T12:55:10.453Z [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
    2021-08-07T12:55:10.460Z [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
    2021-08-07T12:55:10.632Z [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
    2021-08-07T12:55:10.645Z [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-08-07T12:55:10.645Z [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.4:4647 [Candidate]" term=1697
    2021-08-07T12:55:10.714Z [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
    2021-08-07T12:55:11.726Z [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 1.2.3.4:4647 1.2.3.4:4647}" error="dial tcp 1.2.3.4:4647: i/o timeout"
    2021-08-07T12:55:12.296Z [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-08-07T12:55:12.296Z [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.4:4647 [Candidate]" term=1698
    2021-08-07T12:55:13.286Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: No cluster leader" rpc=Node.Register server=172.18.0.4:4647
    2021-08-07T12:55:13.288Z [ERROR] client.rpc: error performing RPC to server, deadline exceeded, cannot retry: error="rpc error: No cluster leader" rpc=Node.Register server=172.18.0.4:4647
    2021-08-07T12:55:13.288Z [ERROR] client: error registering: error="rpc error: No cluster leader"
    2021-08-07T12:55:13.380Z [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 1.2.3.4:4647 1.2.3.4:4647}" error="dial tcp 1.2.3.4:4647: i/o timeout"
    2021-08-07T12:55:13.894Z [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-08-07T12:55:13.895Z [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.4:4647 [Candidate]" term=1699
    2021-08-07T12:55:14.588Z [ERROR] nomad.raft: failed to make requestVote RPC: target="{Voter 1.2.3.4:4647 1.2.3.4:4647}" error="dial tcp 1.2.3.4:4647: i/o timeout"
    2021-08-07T12:55:15.310Z [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-08-07T12:55:15.311Z [INFO]  nomad.raft: entering candidate state: node="Node at 172.18.0.4:4647 [Candidate]" term=1700
    2021-08-07T12:55:15.554Z [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
    2021-08-07T12:55:15.812Z [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
    2021-08-07T12:55:15.814Z [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
    2021-08-07T12:55:15.819Z [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"

Where is that Voter 1.2.3.4:4647 coming from? It’s very confusing.

Like if I add the advertise block as follows

advertise {
  http = "1.2.3.4"
  rpc  = "1.2.3.4"
  serf = "1.2.3.4"
}

The server boots properly but the client is not able to respond to heartbeats

client: error heartbeating. retrying: error="failed to update status: rpc error: failed to get conn: dial tcp 1.2.3.4:4647: i/o timeout" period=1.724916696s

So running nomad agent -dev shows the following config:

Advertise Addrs: HTTP: 127.0.0.1:4646; RPC: 127.0.0.1:4647; Serf: 127.0.0.1:4648
            Bind Addrs: HTTP: 127.0.0.1:4646; RPC: 127.0.0.1:4647; Serf: 127.0.0.1:4648
                Client: true
             Log Level: DEBUG
                Region: global (DC: dc1)
                Server: true
               Version: 1.1.3

Now, trying to use those values in the config does not work:

bind_addr = "127.0.0.1" # the default

advertise {
  # Defaults to the first private IP address.
  http = "127.0.0.1"
  rpc  = "127.0.0.1"
  serf = "127.0.0.1"
}

server {
  enabled          = true
  bootstrap_expect = 1
}

client {
  enabled       = true
  servers = ["127.0.0.1"]
}
2021-08-07T15:55:54.663+0200 [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:1.2.3.4:4647 Address:1.2.3.4:4647}]"
    2021-08-07T15:55:54.663+0200 [INFO]  nomad.raft: entering follower state: follower="Node at 127.0.0.1:4647 [Follower]" leader=
    2021-08-07T15:55:54.664+0200 [INFO]  nomad: serf: EventMemberJoin: EMO-UBUNTU.global 127.0.0.1
    2021-08-07T15:55:54.664+0200 [INFO]  nomad: starting scheduling worker(s): num_workers=16 schedulers=[batch, system, service, _core]
    2021-08-07T15:55:54.665+0200 [WARN]  nomad: serf: Failed to re-join any previously known node
    2021-08-07T15:55:54.665+0200 [INFO]  nomad: adding server: server="EMO-UBUNTU.global (Addr: 127.0.0.1:4647) (DC: dc1)"
    2021-08-07T15:55:54.666+0200 [INFO]  client: using state directory: state_dir=/var/lib/nomad/client
    2021-08-07T15:55:54.666+0200 [INFO]  client: using alloc directory: alloc_dir=/var/lib/nomad/alloc
    2021-08-07T15:55:54.693+0200 [INFO]  client.cpuset_manager: initialized cpuset cgroup manager: parent=/nomad cpuset=0-15
    2021-08-07T15:55:54.722+0200 [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
    2021-08-07T15:55:56.261+0200 [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=
    2021-08-07T15:55:56.261+0200 [INFO]  nomad.raft: entering candidate state: node="Node at 127.0.0.1:4647 [Candidate]" term=5
    2021-08-07T15:55:58.129+0200 [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-08-07T15:55:58.129+0200 [INFO]  nomad.raft: entering candidate state: node="Node at 127.0.0.1:4647 [Candidate]" term=6
    2021-08-07T15:55:59.454+0200 [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-08-07T15:55:59.455+0200 [INFO]  nomad.raft: entering candidate state: node="Node at 127.0.0.1:4647 [Candidate]" term=7
    2021-08-07T15:56:00.545+0200 [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-08-07T15:56:00.545+0200 [INFO]  nomad.raft: entering candidate state: node="Node at 127.0.0.1:4647 [Candidate]" term=8
    2021-08-07T15:56:00.913+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
    2021-08-07T15:56:00.913+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
    2021-08-07T15:56:00.913+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=device
    2021-08-07T15:56:00.939+0200 [INFO]  client: started client: node_id=af6b0316-4a4d-cf82-2edc-ac94dc70f149
    2021-08-07T15:56:01.619+0200 [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-08-07T15:56:01.619+0200 [INFO]  nomad.raft: entering candidate state: node="Node at 127.0.0.1:4647 [Candidate]" term=9
    2021-08-07T15:56:02.701+0200 [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-08-07T15:56:02.701+0200 [INFO]  nomad.raft: entering candidate state: node="Node at 127.0.0.1:4647 [Candidate]" term=10
    2021-08-07T15:56:04.197+0200 [WARN]  nomad.raft: Election timeout reached, restarting election
    2021-08-07T15:56:04.197+0200 [INFO]  nomad.raft: entering candidate state: node="Node at 127.0.0.1:4647 [Candidate]" term=11
    2021-08-07T15:56:04.772+0200 [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
    2021-08-07T15:56:04.828+0200 [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"
    2021-08-07T15:56:04.860+0200 [ERROR] worker: failed to dequeue evaluation: error="No cluster leader"

Again, where’s that 1.2.3.4 coming from???

The fact that the IP 1.2.3.4 was added to the config itself may have caused the upset .

Q: did the ip of the machine 172.18.0.4 ever change? I.e. was it any different before.

Couple of things.

Could you add the server_join syntax (I say that as you have not mentioned if you have Consul as well or not)

Another aggresive approach could be to ensure processes are not running, delete the data_dir and try to start again.

So you’re suggesting this might be a bug in Nomad itself. I thought I was doing something wrong.

No, the IP of the machine never changed.

Why do I need to add server_join if I only need a single machine to act as a server and a client?
No Consul needed in this case.

Going to try deleting the data_dir now

Okay, deleting the data_dir seems to have fixed it.

I still get an error every 30 seconds.

[ERROR] nomad.rpc: unrecognized RPC byte: byte=9

But the UI shows both the server and client healthy.

Btw, any change with the latest version? 1.1.3?

Also, what are the OS specifics? That may help the HashiCorp folks debug better, maybe?

Hi @dg-eparizzi; I am glad you managed to solve your issue. Reading through the comments with @shantanugadgil (thanks for your help <3), it seems that you had stale address data within the data directory from a previous run of the Nomad agent. When an agent starts up, it will check the data dir for configuration, which includes known hosts to connect to, and attempt to connect.

Thanks,
jrasell and the Nomad team

2 Likes

I just had a very similar problem, and it took me quite some time to find the solution.
I had copied bootstrap_expect = 3 from somewhere, but for a single-server installation,
that line should be removed or the 3 changed to a 1.