Hello!
My goal is to notify when a job is dropped. For this, I extract the Prometheus-formatted metrics from the nomad endpoint, and then consume these metrics in my monitoring software (new relic).
When I looked at the metrics, the server metrics, I saw the metric called ‘nomad_nomad_job_status_dead’, but this metric doesn’t say which job is dead, so I discarded it because if I have test jobs they will be mixed with the productive ones and I will have a false positive in my alerts.
I also see the following metrics:
nomad_nomad_job_summary_complete
nomad_nomad_job_summary_failed
nomad_nomad_job_summary_lost
nomad_nomad_job_summary_queued
nomad_nomad_job_summary_running
nomad_nomad_job_summary_unknown
All of them accumulate a counter with these states and information about the name space, task group, etc. But not from dead jobs.
I tried to focus on nomad_nomad_job_summary_failed , counting the crashes, but since it is a historical accumulation, it does not help me to generate an alert in real time, and this counter does not seem to be reset.
On what metric should I base myself to detect a dead job, and be able to identify it (nomad_nomad_job_status_dead does not have namespace information, task group, etc)?
Thanks a lot !