Nomad periodic job metrics

Hello,
I’m trying to configure prometheus alerts for Nomad jobs. I’m fine with generic jobs, but in stuck with periodic jobs.

I would like to configure an alert when a periodic job is running longer than expected. For example, it could hang, so someone should be notified to check it.

I’ve seen something similar (https://github.com/sepulworld/deadman-check) , but maybe there is some way to configure it natively not using 3rd party tools?

It would be awesome to have a metric about job run duration, maybe it’s a good idea to open an enhancement proposal in https://github.com/hashicorp/nomad/issues ?

Thank you!

Hi @vsoloviov,

I think you would be able to get the value you are looking for with a query like this:

delta(timestamp(nomad_nomad_job_summary_running{exported_job=~"example/.*"} > 0)[1d:])

Here’s an example of the plotted output:

I used this simple parameterized job as an example:

job "example" {
  datacenters = ["dc1"]
  type        = "batch"

  parameterized {
    meta_required = ["name"]
  }

  group "greeter" {
    task "greeter" {
      driver = "exec"

      config {
        command = "/bin/bash"
        args    = ["-c", "echo 'Hello ${NOMAD_META_name}!' && sleep 120"]
      }
    }
  }
}

Notice that the first 2 runs had sleep 120 and for the third run I increased that to sleep 300. If I expected my job to finish in 120s I think this query would trigger an alert.

The only caveat is that you will need to set a range for the subquery (I used [1d:] in my example). Since you mentioned this is a periodic job I hope this won’t be a problem for you.

That being said, a metric to track the duration of a batch job does sound useful so we will be happy if you could open an issue for this feature request :grinning_face_with_smiling_eyes:

1 Like

Awesome, so multiplying by the same metric, I can get currently running jobs duration, like

delta(timestamp(nomad_nomad_job_summary_running{periodic_id=~".+"} > 0)[1d:]) * nomad_nomad_job_summary_running{periodic_id=~".+"} > 0 

(I assume nomad_nomad_job_summary_running for periodic jobs can’t be more than 1, so always 1 or 0)

Thank you so much!

1 Like