Nomad client heartbeat failing after successful registration

Hi there,

I got a problem connecting nomad clients to a nomad server.

My setup: I have a single nomad server with its services running on the public IP of the machine. In the same subnet, there are two other machines I want to use as nomad clients, and both of them should export their services on their respective public IPs as well.

According to the nomad server logs, the server seems to be running; I can also issue commands on the cli as well as access the web interface. When starting the client services, they connect to the server, and according to logs and the Nomad web interface they are recognized correctly and are able to communicate. But after maybe 20 seconds or so, the clients become unavailable. Checking the client logs shows that they are failing the RPC heartbeat, because they are trying to address the server on the local docker interface instead of the server’s public IP. I was not able to influence this address, and I am quite confused why the heartbeat fails after successful connection to the server.

After that, they clients are shown as ‘down’ on the server.

Just for the record, the Nomad setup is connected with a consul cluster, all of the nomad machines having a consul client active and connected to the consul server. Judging from the Consul web UI, this seems to work flawlessly.

Server configuration:

datacenter = "dc1"
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"

server {
  enabled = true
  bootstrap_expect = 1
}

plugin "docker" {
  config {
    allow_privileged = true
  }
}

Client configuration:

datacenter = "dc1"
data_dir = "/opt/nomad/data"
bind_addr = "<public ip of the client>"

client {
  enabled = true
  servers = ["<public ip of the server>:4647"]

  server_join {
    retry_join = [ "<public ip of the server>:4647" ]
    retry_interval = "5s"
  }
}

Client logs:

==> Loaded configuration from /etc/nomad.d/docker.hcl, /etc/nomad.d/nomad.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:
       Advertise Addrs: HTTP: <client public ip>:4646
            Bind Addrs: HTTP: [<client public ip>:4646]
                Client: true
             Log Level: INFO
                Region: global (DC: MUC1)
                Server: false
               Version: 1.2.6
==> Nomad agent started! Log data will stream in below:
    2022-03-29T10:00:44.830+0200 [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/data/plugins
    2022-03-29T10:00:44.833+0200 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2022-03-29T10:00:44.833+0200 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2022-03-29T10:00:44.833+0200 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2022-03-29T10:00:44.833+0200 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2022-03-29T10:00:44.833+0200 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2022-03-29T10:00:44.833+0200 [INFO]  client: using state directory: state_dir=/opt/nomad/data/client
    2022-03-29T10:00:44.833+0200 [INFO]  client: using alloc directory: alloc_dir=/opt/nomad/data/alloc
    2022-03-29T10:00:44.833+0200 [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
    2022-03-29T10:00:44.833+0200 [WARN]  client: could not initialize cpuset cgroup subsystem, cpuset management disabled: error="not implemented for cgroup v2 unified hierarchy"
    2022-03-29T10:00:44.911+0200 [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
    2022-03-29T10:00:44.915+0200 [INFO]  client.fingerprint_mgr.consul: consul agent is available
    2022-03-29T10:00:44.916+0200 [WARN]  client.fingerprint_mgr.cpu: failed to detect set of reservable cores: error="not implemented for cgroup v2 unified hierarchy"
    2022-03-29T10:00:44.978+0200 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens1
    2022-03-29T10:00:44.980+0200 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
    2022-03-29T10:00:44.989+0200 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=ens1
    2022-03-29T10:00:44.997+0200 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=docker0
    2022-03-29T10:00:45.034+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
    2022-03-29T10:00:45.034+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
    2022-03-29T10:00:45.034+0200 [INFO]  client.plugin: starting plugin manager: plugin-type=device
    2022-03-29T10:00:45.074+0200 [INFO]  client: started client: node_id=a0ea65b9-ac63-1bfe-529b-ba8237134123
    2022-03-29T10:00:45.078+0200 [INFO]  client: node registration complete
    2022-03-29T10:00:45.085+0200 [INFO]  agent.joiner: starting retry join: servers=<server public ip>:4647
    2022-03-29T10:00:45.086+0200 [INFO]  agent.joiner: retry join completed: initial_servers=1 agent_mode=client
    2022-03-29T10:00:53.307+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: dial tcp 172.17.0.1:4647: connect: connection refused" rpc=Node.Register server=172.17.0.1:4647
    2022-03-29T10:00:53.307+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: dial tcp 172.17.0.1:4647: connect: connection refused" rpc=Node.Register server=172.17.0.1:4647
    2022-03-29T10:00:53.307+0200 [ERROR] client: error registering: error="rpc error: failed to get conn: dial tcp 172.17.0.1:4647: connect: connection refused"
    2022-03-29T10:01:09.574+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: dial tcp 172.17.0.1:4647: connect: connection refused" rpc=Node.UpdateStatus server=172.17.0.1:4647
    2022-03-29T10:01:09.574+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: dial tcp 172.17.0.1:4647: connect: connection refused" rpc=Node.UpdateStatus server=172.17.0.1:4647
    2022-03-29T10:01:09.574+0200 [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: failed to get conn: dial tcp 172.17.0.1:4647: connect: connection refused" period=1.900742498s
    2022-03-29T10:01:09.578+0200 [ERROR] client: error discovering nomad servers:
  error=
  | 1 error occurred:
  |         * rpc error: failed to get conn: dial tcp 172.17.0.1:4647: connect: connection refused
  |
  
    2022-03-29T10:01:10.584+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: dial tcp 172.17.0.1:4647: connect: connection refused" rpc=Node.Register server=172.17.0.1:4647
    2022-03-29T10:01:10.584+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: dial tcp 172.17.0.1:4647: connect: connection refused" rpc=Node.Register server=172.17.0.1:4647
    2022-03-29T10:01:10.584+0200 [ERROR] client: error registering: error="rpc error: failed to get conn: dial tcp 172.17.0.1:4647: connect: connection refused"
    2022-03-29T10:01:11.476+0200 [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: dial tcp 172.17.0.1:4647: connect: connection refused" rpc=Node.UpdateStatus server=172.17.0.1:4647
    2022-03-29T10:01:11.476+0200 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: dial tcp 172.17.0.1:4647: connect: connection refused" rpc=Node.UpdateStatus server=172.17.0.1:4647
    2022-03-29T10:01:11.476+0200 [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: failed to get conn: dial tcp 172.17.0.1:4647: connect: connection refused" period=1.664362838s
    2022-03-29T10:01:11.480+0200 [ERROR] client: error discovering nomad servers:
  error=
  | 1 error occurred:
  |         * rpc error: failed to get conn: dial tcp 172.17.0.1:4647: connect: connection refused
  |

Thanks in advance

Did you ever figure this out? What was the resolution? It looks like you’re specifying port 4647 in your client config but it’s trying to reach the server on port 4646.

Hi i have similar problem. Just testing cluster with 2 windows machine.

  1. Server + client on machine with public ip
  2. Client on machine without public ip.

Solution for was edit nomad.hcl config on server machine with public ip.
Add advertise with public ip of server.

advertise {
http = “publicIp”
rpc = “publicIp”
}

(change public ip for your public ip of server)
Maybe this can help.

ps. Dont forgot to open ports 4646-4648 in firewall

useful links