Hi all,
Bare with me while I set the scene of my conundrum.
I am trying to work through an upgrade process for my nomad and consul “control plane” servers.
In my test I currently use Terraform to spin up 3 server nodes which run consul and nomad server, i also spin up 4 further client nodes. I can add a job to be distributed over the client nodes. It works wonderfully.
I can then spin another full stack as before but running the next version up of consul and nomad, these join the clusters and we now have 6 servers 8 clients running a mix of versions.
So here is the challenge - When I remove the ‘older’ version VMs one at a time all the jobs are migrated and new leaders are elected and the upgrade is seamless.
However when I use terraform to remove the ‘old’ vms it wipes them all out at once and it appears that the loss of 3 nomad server nodes including the elected leader at once is enough to cause a nomad crash that requires manual intervention to recover, this does have a service effecting impact. I do have an on destroy block which sends a service down message for nomad and then consul in all cases (which works perfectly when done in a staggered one at a time fashion)
I wonder if there is any easy way to get terraform to destroy one vm at a time leaving the others intact until destruction is completed?
I have not seen anything but I might be looking in the wrong places.
There a lot of historic git issues that have been closed asking for this sort of functionality.
Many thanks in advance to anyone who has any ideas.
A.