Hello,
I’m trying to configure prometheus alerts for Nomad jobs. I’m fine with generic jobs, but in stuck with periodic jobs.
I would like to configure an alert when a periodic job is running longer than expected. For example, it could hang, so someone should be notified to check it.
Notice that the first 2 runs had sleep 120 and for the third run I increased that to sleep 300. If I expected my job to finish in 120s I think this query would trigger an alert.
The only caveat is that you will need to set a range for the subquery (I used [1d:] in my example). Since you mentioned this is a periodic job I hope this won’t be a problem for you.
That being said, a metric to track the duration of a batch job does sound useful so we will be happy if you could open an issue for this feature request