Detecting Resource Exhaustion / Placement Failure

Is there a datadog or prometheus metric emitted by either the
Nomad server a/o client for detecting and alerting on when a deployment fails due to resource exhaustion?

The statsD metrics don’t really seem to capture a running job that is failing to deploy:

Datadog - Nomad

**nomad.client.allocations.blocked**
(gauge) Number of allocations blocked for a client
*Shown as job*
**nomad.client.allocations.pending**
(gauge) Number of allocations pending for a client
*Shown as job*
**nomad.client.allocations.running**
(gauge) Number of allocations running for a client
*Shown as job*
**nomad.client.allocations.terminal**
(gauge) Number of allocations terminated for a client
*Shown as job*
1 Like

Same here! I have Datadog monitoring any pending jobs and for some reason all I can see is this:

Nomad is not reporting the total at all times… I had 2 jobs in pending mode for several hours during that period and I was expecting the count to be 2, not 0.