How can I notice a failed job from the metrics?

lmontesoro · November 7, 2022, 9:39pm

Hello!

My goal is to notify when a job is dropped. For this, I extract the Prometheus-formatted metrics from the nomad endpoint, and then consume these metrics in my monitoring software (new relic).

When I looked at the metrics, the server metrics, I saw the metric called ‘nomad_nomad_job_status_dead’, but this metric doesn’t say which job is dead, so I discarded it because if I have test jobs they will be mixed with the productive ones and I will have a false positive in my alerts.
I also see the following metrics:
nomad_nomad_job_summary_complete
nomad_nomad_job_summary_failed
nomad_nomad_job_summary_lost
nomad_nomad_job_summary_queued
nomad_nomad_job_summary_running
nomad_nomad_job_summary_unknown

All of them accumulate a counter with these states and information about the name space, task group, etc. But not from dead jobs.

I tried to focus on nomad_nomad_job_summary_failed , counting the crashes, but since it is a historical accumulation, it does not help me to generate an alert in real time, and this counter does not seem to be reset.

On what metric should I base myself to detect a dead job, and be able to identify it (nomad_nomad_job_status_dead does not have namespace information, task group, etc)?

Thanks a lot !

SunSparc · November 10, 2022, 2:33am

Your question intrigued me, so I started looking at the metrics also. While browsing through them I stopped on the metric named nomad_nomad_job_summary_running. It has a label for specific jobs and provides a boolean value. You could perhaps create an alert that tracks this metric and the job you want to watch.

nomad_nomad_job_summary_running{exported_job="my-important-job"}

If the metric value falls below 1 at any time, perhaps that would indicate a failure and then could trigger a notification.

lmontesoro · November 15, 2022, 7:13pm

Hi @SunSparc ! It’s a good approach,thanks! but if the job has more than 1 allocation, I need to know that ahead of time and store it to detect a crash. Ideally, the metric ‘nomad_nomad_job_status_dead’ should have more detail than ‘nomad_nomad_job_summary_running’ does.

Topic		Replies	Views
Nomad periodic job metrics Nomad	2	991	November 27, 2020
How to get alert if job stopped running? Nomad	2	1441	March 12, 2021
Detecting Resource Exhaustion / Placement Failure Nomad	3	1089	January 25, 2020
Monitoring batch Nomad	1	276	October 31, 2022
How to monitor restarts with nomad_client_alloc_restarts Nomad prometheus	0	570	November 15, 2022

How can I notice a failed job from the metrics?

Related topics