Nomad client health checks randomly starts failing

ValentaTomas · April 17, 2024, 1:34am

Nomad client starts “randomly” returning 500 Internal error for /v1/agent/health or timing out which results in failing the VM health checks and restart. The client must fail 10 health checks in 15 second intervals for the VM to restart.

The following logs are from the two incidents on the client (19:33 and 20:17 UTC).

Everything looks alright until the restart. When we enabled debug logs we started seeing a mass of ^@ right before/after the restart in the logs.

We are running a mix of dockerized apps and binaries with Nomad. We are using some custom drivers, but we are rewriting those to be handled by services running as system jobs.

We are running Nomad as a supervisord service.

These are the logs for the incident at 19:33 UTC.
client.txt (13.1 KB)
server.txt (11.2 KB)

Is the null character expected in logs sometimes? Do you have some idea what could be causing this or what additional logs we can inspect to get a better idea of what is happening?

(Related issue I created — Nomad client health checks randomly starts failing · Issue #20422 · hashicorp/nomad · GitHub)

Topic		Replies	Views
Nomad 1.4.3 client health checks on server process Nomad health-check	4	788	February 14, 2023
My consul health check fails on nomad-client Nomad	3	2822	November 21, 2021
Consul health check error: Unexpected health check server error in nomad client Nomad health-check , consul-nomad	0	841	November 18, 2021
Consul Service check on Nomad server has errors Consul consul-nomad	5	2226	May 11, 2020
Health Checks Failing: Consul & Nomad Consul consul-nomad , nomad	3	114	September 12, 2024

Nomad client health checks randomly starts failing

Related topics