Node heartbeat missed

Penton7 · September 16, 2024, 10:04am

Hello, some time ago, periodic issues started occurring on the Consul/Nomad cluster, which leads to job restarts, which in our case is not desirable. After reviewing the cluster and its logs, it was found that the problem is related to the fact that the cluster clients temporarily lose connection and then reconnect after a short period of time, but this causes the jobs in the cluster to restart.

As a hypothesis, this is related to the fact that on the client we get the status: “Node heartbeat missed,” and after a short time, it returns to normal (“Node reregistered by heartbeat”). Logs from the client that lost connection:

 [ERROR] consul.sync: still unable to update services in Consul: failures=10 error="failed to query Consul services: Get \"http://127.0.0.1:8500/v1/agent/services\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[WARN]  client: missed heartbeat: req_latency=1m2.564541959s heartbeat_ttl=15.037957473s since_last_heartbeat=1m17.602999075s
[WARN]  consul.sync: failed to update services in Consul: error="failed to query Consul services: Get \"http://127.0.0.1:8500/v1/agent/services\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=*masked*:4647
[ERROR] client: yamux: keepalive failed: i/o deadline reached
[ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get \"http://127.0.0.1:8500/v1/catalog/datacenters\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: EOF" period=1.903301363s
[ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.UpdateStatus server=*masked*:4647
[ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.UpdateStatus server=*masked*:4647
[ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=*masked*:4647
[ERROR] client: yamux: keepalive failed: i/o deadline reached
[WARN]  consul.sync: failed to update services in Consul: error="failed to query Consul services: Get \"http://127.0.0.1:8500/v1/agent/services\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=*masked*:4647
[ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: EOF" period=1.357147952s
[ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.UpdateStatus server=*masked*:4647
[ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.UpdateStatus server= *masked*:4647
[ERROR] client: yamux: keepalive failed: i/o deadline reached

Nomad version: v1.7.6

Kamilcuk · September 16, 2024, 10:49am

Hi. So what is the issue? How do you want to solve it?

There was no connection to the node. So jobs running there were restarted. Bottom line, your nodes seem have connection issues, which is what you should consider fixing.

You can increase the timeout, hearthbeat timeout and “assuming it’s dead” timeout, or adjust restart or reschedule blocks of your jobs. What have you tried?

Penton7 · September 16, 2024, 11:59am

First, we tried increasing heartbeat_grace, but that didn’t help. Regarding network issues, the cluster is located in AWS, within the same VPC. It’s unlikely that such problems are caused by AWS. It’s possible that the node servers are under load, which might be causing the issues. We’re still investigating. I thought maybe someone had encountered a similar problem, so I decided to ask.

Kamilcuk · September 16, 2024, 12:17pm

Hi. Yes, under high load heartbeat is getting missed, because nomad is not getting any cpu time. I ended up with heartbeat_grace = “5m”.

Penton7 · September 16, 2024, 2:02pm

ok, thanks, we will try it. Maybe it`s help us in this case.

Topic		Replies	Views
Losing heartbeat and re-election leader Consul	0	351	April 24, 2023
Consul Service check on Nomad server has errors Consul consul-nomad	5	2219	May 11, 2020
Consul failing to commit leader election results Consul	9	1790	November 22, 2022
Nomad Clients hearbeat missed on single Server in Cluster failing Nomad	0	945	March 25, 2022
Cluster is unhealthy after weekly patching/reboot Consul	2	371	May 23, 2023

Node heartbeat missed

Related topics