Detecting Resource Exhaustion / Placement Failure

Is there a datadog or prometheus metric emitted by either the
Nomad server a/o client for detecting and alerting on when a deployment fails due to resource exhaustion?

The statsD metrics don’t really seem to capture a running job that is failing to deploy:

Datadog - Nomad

(gauge) Number of allocations blocked for a client
*Shown as job*
(gauge) Number of allocations pending for a client
*Shown as job*
(gauge) Number of allocations running for a client
*Shown as job*
(gauge) Number of allocations terminated for a client
*Shown as job*
1 Like

Same here! I have Datadog monitoring any pending jobs and for some reason all I can see is this:

Nomad is not reporting the total at all times… I had 2 jobs in pending mode for several hours during that period and I was expecting the count to be 2, not 0.

We found that we want to track the pending status of the allocation not the running job. That has shown to be a pretty reliable indicator of job placement failure as the running job will fail if it has not healthy and running allocations after all the restart attempts or placement failures have been exhausted.

1 Like

Can you explain a little more about how you are doing that?