What is the best approach for updating long-draining jobs?

deployer_guy · June 21, 2023, 6:07pm

I have a Nomad cluster running and want to deploy a job that holds open TCP connections for long periods of time. When I update the job with a new version, I want the now “old” jobs to receive a signal to shutdown (SIGHUP or something like that), so that the service will know it can safely exit when all TCP connections are gone. I do not want Nomad to kill any of those old jobs, no matter how long it takes for its connections to drain naturally. It could take a week for all connections to drain and I am OK with that. If the node running the old jobs suddenly shuts down, I would not want Nomad to restart those jobs, only new jobs.

I could not easily find how to fit an update strategy like this into a job’s “Update” configuration. It seems like everything is looking for actual timelines on updates and the new deployment won’t be marked healthy until the old deployment is fully shut down. Is my understanding accurate, or is there a way to fit the scheme I have above into the current job specification? If so, I would appreciate a sample. The only alternative I came up with was to create a brand new job for each deployment and effectively never update a job, relying on some templating and external tools to create job names and such. That idea felt somewhat gross, but it’d make me feel better if that was the expected pattern for a workload like this.

Thanks!

seth.hoenig · June 22, 2023, 3:17pm

@deployer_guy it sounds like you probably want to set the client config max_kill_timeout and each task’s kill_timeout to effectively infinite values, then set kill_signal to something you can trap on in your task to start the shutdown procedure. Nomad would then set the tasks into the stopping state but wait for the tasks to actually exit before doing anything else.

deployer_guy · June 22, 2023, 4:31pm

@seth.hoenig How does that interact with rolling out now versions? My understanding is that in a rolling release configuration, Nomad will wait until each running instance of a job is finished (not in the stopping state) before moving on to the next instance. I don’t want that behavior, I want Nomad to send the signal, move on to update the next instance and then leave each instance alone until the instance actually exits. I also want to be able to use Consul’s DNS to resolve the old instances by name until they actually exit. Is this setup possible?

SunSparc · June 5, 2024, 10:01pm

Did you work out a solution to this scenario? I have a similar scenario. I have applications that can take hours or days to deploy as they work through processing loads of data. In the meantime, the previous versions need to continue running and only be terminated when the new version is successfully deployed.

We have tried different iterations of the update stanza with mixed and unsatisfactory results. As of right now, we have been doing the new job name for each deployment hack. This works, but is not as efficient a solution as I think it could be.

Topic		Replies	Views
Keep the managed process alive in a nomad job after shutting down the nomad agent Nomad	2	907	June 2, 2021
Reboot and maintenance service for client nodes Nomad	2	423	May 3, 2023
Nomad Alloc not stopping forcefully Nomad	13	998	April 21, 2023
Migrating daily jobs to Nomad Nomad	4	1186	March 14, 2022
Beginner questions Nomad	2	646	December 2, 2021

What is the best approach for updating long-draining jobs?

Related topics