I accidentally flooded a staging server with a bunch of redundant job dispatch calls. It quickly consumed the available resources on the single client node I had setup and the whole thing fell over. I was able to stop all of the allocations and get it cleaned up, but - what constraints can I put into the job definition to prevent this problem in the future?
There is not currently a way to limit the number of job dispatch calls. In a future version of Nomad, we do expect to have rate limiting available on all API endpoints, which would help in this situation.
I presume then the scaling stanza wouldn’t apply to a parameterized batch job?
Is there no way for the Nomad server to simply queue or refuse job dispatch calls if there’s no available client resources? If so, that seems strange to me that Nomad would allow a client to be exhausted of resources. But probably I’m just thinking about this in the wrong way?
Chiming in here after more than a year, since I’ve hit the same behaviour. In my case, I have constraints on a job, which gets around the issue of overloading the agent running it. Other dispatches end up in “failed” while there are not enough resources, but the scheduler still knows about them and keeps them in the queue.
As jobs complete, resources are freed and the previously failed allocations are placed.
To prevent redundant job dispatch calls and resource exhaustion on a staging server, you can implement the following constraints in the job definition:
Rate Limiting: Add a rate_limit constraint to throttle job dispatch frequency.
Concurrency Limit: Use a max_concurrency setting to restrict the number of jobs running simultaneously.
Retry Behavior: Define retry limits with backoff strategies to prevent rapid re-dispatch of failed jobs.
Queue Length: Set a queue_size limit to cap the number of pending jobs in the queue.
Timeouts: Implement timeouts for individual jobs to avoid resource locking.
Health Checks: Add monitoring and alerting for unusual resource usage patterns.