Restart policy of successful batch jobs

tobiasmuehl · January 25, 2021, 11:04am

Hi, not using Nomad yet but it sounds highly promising. Our application consists of many individual batch jobs that should be run continuously. The manual deployment system, prior to Nomad, is to start a tmux session on a server and run while :; do ./batch-job.sh ; sleep 5m ; done in bash. Regular cronjobs are possible, although we only want to run 1 instance of a job at a time. Docker-compose implements this with the restart: always directive.

Can we use Nomad in this “infinite-loop of short-lived batch jobs” fashion using the reschedule stanza? Docs only mention “task failure”, not just task completion.

shantanugadgil · January 25, 2021, 11:34am

Hi, in my opinion the comparison is not equivalent (and I think there is a way to achieve what you want).

in Nomad cron (i.e. periodic) has an option to avoid overlap, so possible that can help ?!
In docker-compose, I think (I could be wrong) the restart: always restarts a a failed container, i.e. the expectation is that the container is of type service (in Nomad speak), i.e. if external forces don’t cause the container to exit, it would run forever.

But, I think you don’t multiple invocations (i.e. allocations) due to the periodic as the “name” of each allocation is different, and wrapper code would be needed to track the “current allocation”.

That said, could you try to make the job of type service and set the restart and reschedule stanzas appropriately to get the effectively same result?

The default strategy is to back off, which can sometimes give the impression that the service job is not restarting.

tobiasmuehl · January 25, 2021, 11:51am

Periodic batch jobs with prohibit_overlap = true should work for our use-case. The timing is not the exact same as with the docker-compose or the bash loop solution, but I’m not fussed about it, our batch jobs aren’t time sensitive

Thanks for the quick response!

Topic		Replies	Views
Understanding job restart behaviour on lost jobs Nomad	2	1197	May 12, 2022
Stop and retry batch jobs that have been running for longer than an hour Nomad	1	311	April 16, 2021
Migrating daily jobs to Nomad Nomad	4	1184	March 14, 2022
Force an unchanged batch job to run again Nomad	5	2681	January 28, 2021
Why does this task keep on restarting - What does NOMAD consider a successful task? Nomad	2	494	May 27, 2022

Restart policy of *successful* batch jobs

Related topics

Restart policy of successful batch jobs