Nomad client health checks randomly starts failing

Nomad client starts “randomly” returning 500 Internal error for /v1/agent/health or timing out which results in failing the VM health checks and restart. The client must fail 10 health checks in 15 second intervals for the VM to restart.

The following logs are from the two incidents on the client (19:33 and 20:17 UTC).

Everything looks alright until the restart. When we enabled debug logs we started seeing a mass of ^@ right before/after the restart in the logs.

We are running a mix of dockerized apps and binaries with Nomad. We are using some custom drivers, but we are rewriting those to be handled by services running as system jobs.

We are running Nomad as a supervisord service.

These are the logs for the incident at 19:33 UTC.
client.txt (13.1 KB)
server.txt (11.2 KB)

Is the null character expected in logs sometimes? Do you have some idea what could be causing this or what additional logs we can inspect to get a better idea of what is happening?

(Related issue I created — Nomad client health checks randomly starts failing · Issue #20422 · hashicorp/nomad · GitHub)

1 Like