Limit job dispatch calls for parameterized batch job?

chrismo · August 15, 2023, 2:49am

I accidentally flooded a staging server with a bunch of redundant job dispatch calls. It quickly consumed the available resources on the single client node I had setup and the whole thing fell over. I was able to stop all of the allocations and get it cleaned up, but - what constraints can I put into the job definition to prevent this problem in the future?

Nomad 1.6.1.

jrasell · August 15, 2023, 11:48am

Hi @chrismo,

There is not currently a way to limit the number of job dispatch calls. In a future version of Nomad, we do expect to have rate limiting available on all API endpoints, which would help in this situation.

Thanks,
jrasell and the Nomad team

chrismo · August 15, 2023, 2:37pm

I presume then the scaling stanza wouldn’t apply to a parameterized batch job?

Is there no way for the Nomad server to simply queue or refuse job dispatch calls if there’s no available client resources? If so, that seems strange to me that Nomad would allow a client to be exhausted of resources. But probably I’m just thinking about this in the wrong way?

brucellino1 · November 21, 2024, 4:21pm

Chiming in here after more than a year, since I’ve hit the same behaviour. In my case, I have constraints on a job, which gets around the issue of overloading the agent running it. Other dispatches end up in “failed” while there are not enough resources, but the scheduler still knows about them and keeps them in the queue.

As jobs complete, resources are freed and the previously failed allocations are placed.

Does this solve your issue?

nickfreedy574 · December 1, 2024, 12:36pm

To prevent redundant job dispatch calls and resource exhaustion on a staging server, you can implement the following constraints in the job definition:

Rate Limiting: Add a rate_limit constraint to throttle job dispatch frequency.

Concurrency Limit: Use a max_concurrency setting to restrict the number of jobs running simultaneously.

Retry Behavior: Define retry limits with backoff strategies to prevent rapid re-dispatch of failed jobs.

Queue Length: Set a queue_size limit to cap the number of pending jobs in the queue.

Timeouts: Implement timeouts for individual jobs to avoid resource locking.

Health Checks: Add monitoring and alerting for unusual resource usage patterns.

Topic		Replies	Views
Specify or modify a job constraint when invoking dispatch job API Nomad	2	445	August 9, 2021
Does Nomad have a delivery guarantee for parameterized jobs? Nomad	1	356	June 4, 2021
Limit nomad job startup concurrency (max jobs spawning at once) Nomad	0	571	September 30, 2019
Nomad constraint in parameter job Nomad	1	29	October 8, 2024
How to stop all parameterized batch job? Nomad	0	463	July 10, 2020

Limit job dispatch calls for parameterized batch job?

Related topics