Timeout job placement failures

spaulg · April 29, 2021, 10:40am

I am running in to scenarios whilst developing an integration with Nomad where a job cannot be placed due to a lack of resource (memory). Nomad cannot immediately place the job within the cluster but does not immediately fail the job and allows it to wait until resource becomes available.

Instead I’d like to have Nomad regard the placement as a failure after a timeout, which could by regarded as immediately. Is this possible?

jrasell · May 14, 2021, 10:41am

Hi @spaulg,

I do not believe this is currently possible. Nomad will put the job into a blocked state with the hope that it will eventually be unblocked due to cluster scaling, preemption or other work finishing.

I’d be curious to understand the use case you have to require the job to fail, rather than become blocked?

Thanks,
jrasell and the Nomad team

spaulg · May 14, 2021, 1:24pm

Hi @jrasell

The app I’m developing has an administrative UI used to launch applications. The applications created require a number steps for install, configuration, etc, which are batch jobs. Before having a service job for the final web server and backend application.

The problem I’m facing is that the batch jobs are passed to nomad but never start because of resource constraints. Instead of just failing and causing the UI to report the failure, I’m waiting for the UI to pick up the failure using a timeout.

For my use case, if the cluster has no more resource, then its unlikely that will change as resource usage does not flucuate all that much in my use case.

Therefore, it makes little sense to wait 5 minutes before timing out the job, rather than just regarding the job failed immediately because it never started.

Unless, can I detect jobs have blocked due to resource constraint through the API?

Thanks
spaulg

tomqwpl · October 13, 2021, 8:41am

I also have this question. I’m trying to work out whether there is some kind of policy like “restart policy” or “reschedule policy” to describe what to do in the case of a placement failure.
In my case a placement failure is likely to be permanent, due to something like a constraint on the kernel type, or a constraint on a version of software being installed on the node. I’m currently just detecting and analysing placement failures externally and deregistering the job, but it would be nice if there was a policy you could give. I can imagine scenarios where you want to say “unless you can run this job within 30 seconds, don’t bother and fail it instead”.

Topic		Replies	Views
Nomad placement failures unrelated constraints and resource allocation Nomad	0	805	May 24, 2022
Understanding job restart behaviour on lost jobs Nomad	2	1195	May 12, 2022
How fast can job be deployed and how fast can we get notified about it? Nomad jobs	0	133	March 25, 2024
How to find out why a job placement is failing with a constraint Nomad	1	1567	November 21, 2021
Avoid rescheduling due to resource constraints with batch jobs Nomad	3	756	April 16, 2021

Timeout job placement failures

Related topics