Hi, we found some surprising behaviour on our 3*server nomad cluster:
- we misconfigured the security groups on a single server
- this caused the server to be marked as “left” (I would expect “failed”)
- this caused ~1/3 of all allocations to be reallocated
I have been trying to understand this behaviour. On each node that had the reallocations, there are log lines indicating heartbeat failures, and then
client: error discovering nomad servers: error="no Nomad Servers advertising service \"nomad\" in Consul datacenters: [\"x\"]
This is strange, because the other two nomad servers were definitely still working. I can’t find information about this on the hashicorp docs, but a stack overflow post indicated that every nomad client picks a server (at random?), and then always heartbeats and sends updates to that same server.
If true, does this really mean that the temporary outage of a single server will cause 1/3 of all allocations to be rescheduled? Are there any changes we can make to the clients so that in the event of a server failure, they send heartbeats to the other working servers?