Hello, some time ago, periodic issues started occurring on the Consul/Nomad cluster, which leads to job restarts, which in our case is not desirable. After reviewing the cluster and its logs, it was found that the problem is related to the fact that the cluster clients temporarily lose connection and then reconnect after a short period of time, but this causes the jobs in the cluster to restart.
As a hypothesis, this is related to the fact that on the client we get the status: “Node heartbeat missed,” and after a short time, it returns to normal (“Node reregistered by heartbeat”). Logs from the client that lost connection:
[ERROR] consul.sync: still unable to update services in Consul: failures=10 error="failed to query Consul services: Get \"http://127.0.0.1:8500/v1/agent/services\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[WARN] client: missed heartbeat: req_latency=1m2.564541959s heartbeat_ttl=15.037957473s since_last_heartbeat=1m17.602999075s
[WARN] consul.sync: failed to update services in Consul: error="failed to query Consul services: Get \"http://127.0.0.1:8500/v1/agent/services\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=*masked*:4647
[ERROR] client: yamux: keepalive failed: i/o deadline reached
[ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get \"http://127.0.0.1:8500/v1/catalog/datacenters\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: EOF" period=1.903301363s
[ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.UpdateStatus server=*masked*:4647
[ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.UpdateStatus server=*masked*:4647
[ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=*masked*:4647
[ERROR] client: yamux: keepalive failed: i/o deadline reached
[WARN] consul.sync: failed to update services in Consul: error="failed to query Consul services: Get \"http://127.0.0.1:8500/v1/agent/services\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
[ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=*masked*:4647
[ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: EOF" period=1.357147952s
[ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.UpdateStatus server=*masked*:4647
[ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.UpdateStatus server= *masked*:4647
[ERROR] client: yamux: keepalive failed: i/o deadline reached
Nomad version: v1.7.6