We’re having some real difficulties with nomad. Our service jobs seem fine, and seem to self-heal, but periodic jobs? Seems really difficult to determine if they actually ran after the fact. We have a number of jobs that run overnight, and sometimes they’ve run two simultaneously (even with prohibit_overlap true), we have some that don’t show any allocations, and some that run every other day (even when the task ends after ~ 1 hour…). What do other people do to monitor their periodic jobs? I can’t control all of the jobs themselves, so changing the jobs to call a webhook when they complete doesn’t work. We do have centralized logging, but I’m not going to dig through that every day for every job to see if it thinks it ran successfully (and some of those jobs don’t have in the way of logging anyways). Do we need to just dump the current status every few minutes and write a progrma to analyse and determine if a job didn’t run on time? Is there a way to do this in the GUI I’m not seeing? Is there historical job-run log I don’t see? Even an API or CLI would do - I can wrap it into a program to call (via system crontab!) to get the info out of Nomad.
1 Like
I’m also looking into this as of recent any insight would be helpful to start my investigation.
Hi. I posted Nomad Job Launches UI empty / no history of periodic jobs - #2 by Kamilcuk about it.
Basically execute nomad operator api /v1/event/stream > file
in the background and then parse the file
with some python script. Such event stream will give you all the information, and if some information is missing, you can detect that situation and send an alert. For an example stack, Grafana Loki has a absent_over_time
operator and support for json parsing and Grafana can send alerts based on Grafana Loki.