Monitor time taken for Job scheduling

I have a question regarding Nomad Telemetry metrics. I am currently experiencing an issue sometimes where a job takes a lot of time to have a running allocation and I would like to monitor the time it takes from when a Job is submitted to when an allocation for the same job is running. After going through the scheduling documentation I understood the following:

  • A “create evaluation” is created when a job is registered
    Evaluation goes to the Eval Broker Leader which resides on the leader node and then put in a queue.
  • The Eval Broker leader sends that evaluation to the correct scheduler
  • The scheduler generates an allocation plan. The scheduler also does feasibility checking and ranking to decide on which node the allocation must go and after it decides, it adds it to the allocation plan.
  • Once the plan is complete the scheduler submits it to the Plan leader which puts it in the queue and is then eventually processed.

When enabling telemetry I see that there are a lot of metrics which determine the time taken for some of the scheduling calls however it is not so straight forward which metric/s should be used as there are quite a few and from the documentation it seems that some of them overlap. I believe a sum of more than 1 metric is required to achieve the total duration.

What is also confusing is that most of the metrics dont have values. For example I see that we had 300 running jobs using nomad_nomad_job_status_running. I assume since we had 300 running jobs in the past day, then these jobs had to be registered. Having said that when I look at nomad_nomad_job_register I get an empty graph with no values.

Any feedback would be greatly appreciated and thanks in advance.

Hi @kfenech1! Thanks for using Nomad!

According to the documentation “We do not currently surface metrics for job and task/allocation status, although we will consider adding metrics where it makes sense.”

That said, I’m wondering if monitoring the event stream for your specific job will get you the information you are looking for. Does this fit the use case you are looking for and is this a viable option for you?