Recovery from failed client

I have a Nomad cluster running with 3 servers and 3 clients. There are a few jobs running, most of which have a task group with count = 2, and using spread with the AWS availability zone attribute to prefer allocating to clients in specified AZs. I used this to manually balance the load across those 3 clients.

Last night one of the clients failed for unknown reasons. The auto scaling group terminated that instance and launched a new one, and it came up healthy. In the meantime, Nomad detected the client failure and created new allocations on the two other clients. Everything is running with the correct number of instances, but now I have two clients bearing all of the load, and one that is idle. I would like to adjust this so that the recovery from a failed client will automatically rebalance the load, but I’m not sure how to go about this.

I think this issue is relevant, but it’s unclear to me whether automatic rebalancing after a client failure is not yet supported or whether I just don’t grasp how to leverage the allocation lifecycle APIs to make this work. Can anyone clarify this for me?

Also, the failed client’s allocations are still in the running state hours after the failure. So it seems that the failure triggered an evaluation and the creation of new allocations, but not the deletion of the old allocations. Confused by this, I ran nomad system gc && nomad system reconcile summaries to see if it would clean things up, but those allocations are still running. Will they enter the failed state after some period of time? What would happen if the failed client came back up again?

-Jesse

Hi @Jesse_S! You’re right that Ability to rebalance allocation placements · Issue #1635 · hashicorp/nomad · GitHub is about automatic rebalancing. The “lifecycle API” referred to there does exist, so you should be able rebalance manually via nomad alloc stop $alloc_id. This will stop the allocation you’ve picked and then the scheduler will run a new evaluation and that should result in the new allocation being placed as expected (depending on the current resources, binpacking vs spread blocks, etc.)

Thanks for the quick response @tgross. Do I also need to stop the allocations for the failed client manually or will they get cleaned up through some other mechanism? I don’t understand why they’re still in the running state after the evaluation that led to new allocations being created.

Never mind! Those allocations were not found from the command line and disappeared from my browser when I reloaded the page. So it seems to be an issue with the GUI or just user error.