Nomad server failure caused reallocation

kkbe · April 27, 2022, 12:00pm

Hi, we found some surprising behaviour on our 3*server nomad cluster:

we misconfigured the security groups on a single server
this caused the server to be marked as “left” (I would expect “failed”)
this caused ~1/3 of all allocations to be reallocated
I have been trying to understand this behaviour. On each node that had the reallocations, there are log lines indicating heartbeat failures, and then client: error discovering nomad servers: error="no Nomad Servers advertising service \"nomad\" in Consul datacenters: [\"x\"]
This is strange, because the other two nomad servers were definitely still working. I can’t find information about this on the hashicorp docs, but a stack overflow post indicated that every nomad client picks a server (at random?), and then always heartbeats and sends updates to that same server.
If true, does this really mean that the temporary outage of a single server will cause 1/3 of all allocations to be rescheduled? Are there any changes we can make to the clients so that in the event of a server failure, they send heartbeats to the other working servers?
Thanks

Topic		Replies	Views
Nomad Clients hearbeat missed on single Server in Cluster failing Nomad	0	946	March 25, 2022
Services from failed allocations isn't unregistered Nomad	3	448	August 25, 2021
Nomad not rescheduling system jobs on nodes that previously ran out of disk space Nomad	2	295	July 7, 2022
3-server Nomad cluster seems to become unstable after brief network partition of non-leader server? Nomad	2	1221	October 27, 2022
How to ReAllocate Nomad job after client recovery Nomad	2	770	January 4, 2023