I have a Nomad cluster running with 3 servers and 3 clients. There are a few jobs running, most of which have a task group with
count = 2, and using
spread with the AWS availability zone attribute to prefer allocating to clients in specified AZs. I used this to manually balance the load across those 3 clients.
Last night one of the clients failed for unknown reasons. The auto scaling group terminated that instance and launched a new one, and it came up healthy. In the meantime, Nomad detected the client failure and created new allocations on the two other clients. Everything is running with the correct number of instances, but now I have two clients bearing all of the load, and one that is idle. I would like to adjust this so that the recovery from a failed client will automatically rebalance the load, but I’m not sure how to go about this.
I think this issue is relevant, but it’s unclear to me whether automatic rebalancing after a client failure is not yet supported or whether I just don’t grasp how to leverage the allocation lifecycle APIs to make this work. Can anyone clarify this for me?
Also, the failed client’s allocations are still in the running state hours after the failure. So it seems that the failure triggered an evaluation and the creation of new allocations, but not the deletion of the old allocations. Confused by this, I ran
nomad system gc && nomad system reconcile summaries to see if it would clean things up, but those allocations are still running. Will they enter the failed state after some period of time? What would happen if the failed client came back up again?