Node heartbeat missed

Hello, some time ago, periodic issues started occurring on the Consul/Nomad cluster, which leads to job restarts, which in our case is not desirable. After reviewing the cluster and its logs, it was found that the problem is related to the fact that the cluster clients temporarily lose connection and then reconnect after a short period of time, but this causes the jobs in the cluster to restart.

As a hypothesis, this is related to the fact that on the client we get the status: “Node heartbeat missed,” and after a short time, it returns to normal (“Node reregistered by heartbeat”). Logs from the client that lost connection:

 [ERROR] consul.sync: still unable to update services in Consul: failures=10 error="failed to query Consul services: Get \"http://127.0.0.1:8500/v1/agent/services\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[WARN]  client: missed heartbeat: req_latency=1m2.564541959s heartbeat_ttl=15.037957473s since_last_heartbeat=1m17.602999075s
[WARN]  consul.sync: failed to update services in Consul: error="failed to query Consul services: Get \"http://127.0.0.1:8500/v1/agent/services\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=*masked*:4647
[ERROR] client: yamux: keepalive failed: i/o deadline reached
[ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get \"http://127.0.0.1:8500/v1/catalog/datacenters\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: EOF" period=1.903301363s
[ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.UpdateStatus server=*masked*:4647
[ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.UpdateStatus server=*masked*:4647
[ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=*masked*:4647
[ERROR] client: yamux: keepalive failed: i/o deadline reached
[WARN]  consul.sync: failed to update services in Consul: error="failed to query Consul services: Get \"http://127.0.0.1:8500/v1/agent/services\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=*masked*:4647
[ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: EOF" period=1.357147952s
[ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.UpdateStatus server=*masked*:4647
[ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.UpdateStatus server= *masked*:4647
[ERROR] client: yamux: keepalive failed: i/o deadline reached

Nomad version: v1.7.6

Hi. So what is the issue? How do you want to solve it?

There was no connection to the node. So jobs running there were restarted. Bottom line, your nodes seem have connection issues, which is what you should consider fixing.

You can increase the timeout, hearthbeat timeout and “assuming it’s dead” timeout, or adjust restart or reschedule blocks of your jobs. What have you tried?

First, we tried increasing heartbeat_grace, but that didn’t help. Regarding network issues, the cluster is located in AWS, within the same VPC. It’s unlikely that such problems are caused by AWS. It’s possible that the node servers are under load, which might be causing the issues. We’re still investigating. I thought maybe someone had encountered a similar problem, so I decided to ask.

Hi. Yes, under high load heartbeat is getting missed, because nomad is not getting any cpu time. I ended up with heartbeat_grace = “5m”.

ok, thanks, we will try it. Maybe it`s help us in this case.