Nomad periodic job metrics

vsoloviov · November 27, 2020, 1:15pm

Hello,
I’m trying to configure prometheus alerts for Nomad jobs. I’m fine with generic jobs, but in stuck with periodic jobs.

I would like to configure an alert when a periodic job is running longer than expected. For example, it could hang, so someone should be notified to check it.

I’ve seen something similar (https://github.com/sepulworld/deadman-check) , but maybe there is some way to configure it natively not using 3rd party tools?

It would be awesome to have a metric about job run duration, maybe it’s a good idea to open an enhancement proposal in https://github.com/hashicorp/nomad/issues ?

Thank you!

lgfa29 · November 27, 2020, 2:34pm

Hi @vsoloviov,

I think you would be able to get the value you are looking for with a query like this:

delta(timestamp(nomad_nomad_job_summary_running{exported_job=~"example/.*"} > 0)[1d:])

Here’s an example of the plotted output:

I used this simple parameterized job as an example:

job "example" {
  datacenters = ["dc1"]
  type        = "batch"

  parameterized {
    meta_required = ["name"]
  }

  group "greeter" {
    task "greeter" {
      driver = "exec"

      config {
        command = "/bin/bash"
        args    = ["-c", "echo 'Hello ${NOMAD_META_name}!' && sleep 120"]
      }
    }
  }
}

Notice that the first 2 runs had sleep 120 and for the third run I increased that to sleep 300. If I expected my job to finish in 120s I think this query would trigger an alert.

The only caveat is that you will need to set a range for the subquery (I used [1d:] in my example). Since you mentioned this is a periodic job I hope this won’t be a problem for you.

That being said, a metric to track the duration of a batch job does sound useful so we will be happy if you could open an issue for this feature request

vsoloviov · November 27, 2020, 3:45pm

Awesome, so multiplying by the same metric, I can get currently running jobs duration, like

delta(timestamp(nomad_nomad_job_summary_running{periodic_id=~".+"} > 0)[1d:]) * nomad_nomad_job_summary_running{periodic_id=~".+"} > 0

(I assume nomad_nomad_job_summary_running for periodic jobs can’t be more than 1, so always 1 or 0)

Thank you so much!

Topic		Replies	Views
How can I notice a failed job from the metrics? Nomad	2	964	November 15, 2022
How to monitor nomad jobs history? Nomad	2	850	December 9, 2024
How to get alert if job stopped running? Nomad	2	1413	March 12, 2021
Monitoring batch Nomad	1	274	October 31, 2022
How to define a parameterized periodic nomad job Nomad nomad	2	129	June 30, 2024

Nomad periodic job metrics

Related topics