Reboot and maintenance service for client nodes

pablopla · May 2, 2023, 10:31pm

Is there a way to trigger reboot of client nodes?
I’m looking for a way to drain a node, upgrade system packages on the host, reboot it and re-add or re-enable the same jobs. In my case most of the time there are no extra empty nodes so migrating the jobs is not an option.

I’ve found a related CloudFlare blog post and opened github issue.
There are also feature requests related to Consul maintenance mode but ideally it could also work in Nomad without Consul integration.

The steps could be:

Mark a node for maintenance.
Drain the node and wait.
Do the maintenance, e.g. update packages on the host.
Reboot the node.

hector.medina.cabane · May 3, 2023, 8:45am

I’m also interested in this. We haven’t faced this issue yet as our cluster is not that old, but this will come to us sooner than later.

I would appreciate some contribution too.

My guess would be:

Add a new client to the cluster (already updated).
Migrate the jobs to this new client.
Destroy the old client.

If you are not using a cloud provider and you cannot do this, you might have to cause a downtime.

I would decrease the number of jobs to fit in the remaining clients somehow (count, for example or change resource limits), delete the client from the cluster, update it and then add it again. This could be done with all the client nodes. Once done, you can set the job numbers and resources attached to them as normal.

I’m sure there is a more fancier way to do this, but this is my approach.

Edit:

Instead of editing the job numbers, it could be useful to set priorities between jobs. So those who are more important can be deployed before less important ones.

You have to enable preemption to allow Nomad to kill running jobs to schedule jobs with more priority.

Test it first! I haven’t tried myself.

brucellino1 · May 3, 2023, 11:04am

Sounds like you want a Nomad operator…

Using the API you can mark a node for drainage.

If the jobs on that node have a migration stanza then they will be migrated off the node while it is in draining status.

Once there are no more jobs left on the node, it can leave cluster for maintenance, and then re-join when maintenance is done.

If we imagine the operator doing all of this, I would do something like:

Create a job for the operator.
The job runs periodic probes on nodes to see whether they are eligible for maintenance
Nodes which are eligible for maintenance get put into a list
For nodes in the list:
1. Drain the node
2. When node is empty run update playbook against node
3. Re-join the cluster
4. pop node off list

One could imagine a small application which kept watch over the nodes in this way, based on probes which check their compliance status. Of course, the application itself (the operator), would also be migrated when the node on which it is running is set for maintenance.

The key thing here, it seems is the migrate {} stanza.

Topic		Replies	Views
Does reloading a Nomad client or server creates Job downtime? Nomad	1	261	July 20, 2023
Nomad - remove down or left nodes from cluster Nomad	3	6485	June 8, 2023
What is the best approach for updating long-draining jobs? Nomad	3	252	June 5, 2024
Nomad not rescheduling system jobs on nodes that previously ran out of disk space Nomad	2	295	July 7, 2022
Contraints and Priority Nomad	4	536	February 1, 2022

Reboot and maintenance service for client nodes

Related topics