Raft using wrong network interface?

I have a setup with one Nomad server and one client. They are connected using VPC on 10.10.10.0/24. There are also Consul server/client on same servers.

Nomad client is failing to connect to server because it tries to connect on 10.18.0.8 net for some reason.

Here’s Nomad server initial output:

==> Nomad agent configuration:

       Advertise Addrs: HTTP: 0.0.0.0:4646; RPC: 10.10.10.2:4647; Serf: 10.10.10.2:4648
            Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 10.10.10.2:4647; Serf: 10.10.10.2:4648
                Client: false
             Log Level: INFO
                Region: global (DC: dc1)
                Server: true
               Version: 1.4.3

==> Nomad agent started! Log data will stream in below:

    2022-12-15T07:11:56.912Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/plugins
    2022-12-15T07:11:56.913Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2022-12-15T07:11:56.913Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2022-12-15T07:11:56.913Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2022-12-15T07:11:56.913Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2022-12-15T07:11:56.913Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2022-12-15T07:11:56.924Z [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
    2022-12-15T07:11:56.926Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:5fe02441-9379-3ad9-94b2-d95980990db3 Address:10.18.0.8:4647}]"

For some reason raft is using that net, but isn’t configured anywhere.
Any idea how to fix it?

Hi @ilibar-zpt,

It is only the Nomad servers that connect and share state via Raft and the log output you have attached indicates the server has found configuration of another server to connect with. Is it possible that the Nomad server data directory has stale data within it from previous configurations and is using that to attempt to find other servers to talk to? If so, I would suggest removing the entire Nomad server data directory and starting the agent again.

Nomad client is failing to connect to server

The log output you have included only comes from the Nomad server, do you have logs from the client that indicate a failure to connect with the server? If you’re able to share the configuration for both the server and client, that would also be useful to help identify any potential problems.

Thanks,
jrasell and the Nomad team

1 Like

@jrasell thanks for reply
Here’s client’s log sample:

==> Nomad agent configuration:

       Advertise Addrs: HTTP: 10.10.10.3:4646
            Bind Addrs: HTTP: [10.10.10.3:4646]
                Client: true
             Log Level: INFO
                Region: global (DC: dc1)
                Server: false
               Version: 1.4.3

==> Nomad agent started! Log data will stream in below:

    2022-12-15T07:10:57.040Z [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/plugins
    2022-12-15T07:10:57.042Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2022-12-15T07:10:57.042Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2022-12-15T07:10:57.042Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2022-12-15T07:10:57.042Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2022-12-15T07:10:57.042Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2022-12-15T07:10:57.043Z [INFO]  client: using state directory: state_dir=/opt/nomad/client
    2022-12-15T07:10:57.043Z [INFO]  client: using alloc directory: alloc_dir=/opt/nomad/alloc
    2022-12-15T07:10:57.043Z [INFO]  client: using dynamic ports: min=20000 max=32000 reserved=""
    2022-12-15T07:10:57.053Z [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
    2022-12-15T07:10:57.057Z [INFO]  client.fingerprint_mgr.consul: consul agent is available
    2022-12-15T07:10:57.061Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=eth0
    2022-12-15T07:10:57.062Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
    2022-12-15T07:10:57.066Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=eth0
    2022-12-15T07:10:57.072Z [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=eth1
    2022-12-15T07:10:57.114Z [WARN]  client.fingerprint_mgr.env_digitalocean: failed to read attribute: attribute=private-ipv6 error="error reading attribute interfaces/private/0/ipv6/address. digitalocean metadata api returned an error: resp_code: 404, resp_body: not found"
    2022-12-15T07:10:57.124Z [WARN]  client.fingerprint_mgr.env_digitalocean: failed to read attribute: attribute=public-ipv6 error="error reading attribute interfaces/public/0/ipv6/address. digitalocean metadata api returned an error: resp_code: 404, resp_body: not found"
    2022-12-15T07:10:57.202Z [INFO]  client.plugin: starting plugin manager: plugin-type=csi
    2022-12-15T07:10:57.202Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
    2022-12-15T07:10:57.202Z [INFO]  client.plugin: starting plugin manager: plugin-type=device
    2022-12-15T07:10:57.204Z [INFO]  client: started client: node_id=f38ca24e-f289-f8f0-2001-8baf4a5cb27f
    2022-12-15T07:10:57.204Z [WARN]  client.server_mgr: no servers available
    2022-12-15T07:10:57.204Z [WARN]  client.server_mgr: no servers available
    2022-12-15T07:10:57.212Z [INFO]  client.consul: discovered following servers: servers=[10.18.0.8:4647]
    2022-12-15T07:11:00.261Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=10.18.0.8:4647
    2022-12-15T07:11:00.261Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=10.18.0.8:4647

I just cannot understand how it keeps advertising 10.18.0.8 when it’s not even default and was never configured for nomad, feels like it picks some net interface independently, but idk
Here’s my routes:

# ip r
default via 206.189.0.1 dev eth0 proto static 
10.10.10.0/24 dev eth1 proto kernel scope link src 10.10.10.2 
10.18.0.0/16 dev eth0 proto kernel scope link src 10.18.0.8 
206.189.0.0/20 dev eth0 proto kernel scope link src 206.189.8.128 

Seems like it was stale data, will update if resolves

1 Like