Timeout job placement failures

I am running in to scenarios whilst developing an integration with Nomad where a job cannot be placed due to a lack of resource (memory). Nomad cannot immediately place the job within the cluster but does not immediately fail the job and allows it to wait until resource becomes available.

Instead I’d like to have Nomad regard the placement as a failure after a timeout, which could by regarded as immediately. Is this possible?

Hi @spaulg,

I do not believe this is currently possible. Nomad will put the job into a blocked state with the hope that it will eventually be unblocked due to cluster scaling, preemption or other work finishing.

I’d be curious to understand the use case you have to require the job to fail, rather than become blocked?

Thanks,
jrasell and the Nomad team

Hi @jrasell

The app I’m developing has an administrative UI used to launch applications. The applications created require a number steps for install, configuration, etc, which are batch jobs. Before having a service job for the final web server and backend application.

The problem I’m facing is that the batch jobs are passed to nomad but never start because of resource constraints. Instead of just failing and causing the UI to report the failure, I’m waiting for the UI to pick up the failure using a timeout.

For my use case, if the cluster has no more resource, then its unlikely that will change as resource usage does not flucuate all that much in my use case.

Therefore, it makes little sense to wait 5 minutes before timing out the job, rather than just regarding the job failed immediately because it never started.

Unless, can I detect jobs have blocked due to resource constraint through the API?

Thanks
spaulg

I also have this question. I’m trying to work out whether there is some kind of policy like “restart policy” or “reschedule policy” to describe what to do in the case of a placement failure.
In my case a placement failure is likely to be permanent, due to something like a constraint on the kernel type, or a constraint on a version of software being installed on the node. I’m currently just detecting and analysing placement failures externally and deregistering the job, but it would be nice if there was a policy you could give. I can imagine scenarios where you want to say “unless you can run this job within 30 seconds, don’t bother and fail it instead”.