Staggered destroy for removing nomad and consul server nodes

Krazer · September 10, 2020, 4:23pm

Hi all,
Bare with me while I set the scene of my conundrum.

I am trying to work through an upgrade process for my nomad and consul “control plane” servers.
In my test I currently use Terraform to spin up 3 server nodes which run consul and nomad server, i also spin up 4 further client nodes. I can add a job to be distributed over the client nodes. It works wonderfully.

I can then spin another full stack as before but running the next version up of consul and nomad, these join the clusters and we now have 6 servers 8 clients running a mix of versions.

So here is the challenge - When I remove the ‘older’ version VMs one at a time all the jobs are migrated and new leaders are elected and the upgrade is seamless.
However when I use terraform to remove the ‘old’ vms it wipes them all out at once and it appears that the loss of 3 nomad server nodes including the elected leader at once is enough to cause a nomad crash that requires manual intervention to recover, this does have a service effecting impact. I do have an on destroy block which sends a service down message for nomad and then consul in all cases (which works perfectly when done in a staggered one at a time fashion)

I wonder if there is any easy way to get terraform to destroy one vm at a time leaving the others intact until destruction is completed?
I have not seen anything but I might be looking in the wrong places.
There a lot of historic git issues that have been closed asking for this sort of functionality.

Many thanks in advance to anyone who has any ideas.

A.

Topic		Replies	Views
Should consul run on all nodes or just server nodes? Nomad consul-nomad	6	1278	March 5, 2023
Nomad and Consul servers on the same node Nomad	7	3586	October 19, 2020
Clarification on `consul lock` behaviours and rotation of Consul Server nodes Consul	0	435	July 21, 2021
Reboot and maintenance service for client nodes Nomad	2	401	May 3, 2023
Consul service deregistration upon job reallocation Nomad	2	625	December 14, 2021

Staggered destroy for removing nomad and consul server nodes

Related topics