I’m trying to get my head around reducing a group count (from say 3 to 2) in a 3 node cluster such that i can drain a single node prior to OS patching / restart.
On a Vagrant 3-node test cluster, i have a simple job task with group count=3 and the job is healthy and stable.
My assumption would be that I could:
- Mark one node as ineligible
- Reduce the job group count from 3 to 2
- The scheduler would identify that it ‘should’ kill the allocation on the node that is ineiligible.
- I could then drain the node and patch / restart
The issue is the scheduler is killing a random allocation, which means when i go to drain the node, a migration will occur for the allocation running on that node, back to another node (where it was effectively just killed as part of the reduced count).
Is there a more efficient way to reduce group count and identify which client node the allocation should be killed on (ie examine ineligible nodes with running allocations).