Nomad job troubleshooting


Today I had an issue in which a long running job became stuck at “queued” and “dead”. This issue clearly showed me that I don’t have a lot of experience troubleshooting issues in nomad. I thought I would simply learn as I go but failures in nomad are so rare I haven’t even looked at my jobs in a long time (a year?). The fix for todays issue (which is always how I fix issues) is to use terraform to simply taint the job and re-apply it. (I believe this is the equivalent of stopping and starting a job).

Since failures are so few and far between I think I need to put together a generic troubleshooting document. What exactly should I be checking for? In today’s example when I ran “nomad status ”. It had output like:

ID            = shipping-rates
Name          = shipping-rates
Submit Date   = 11/11/19 21:00:08 UTC
Type          = service
Priority      = 50
Datacenters   = my-data-center
Status        = dead
Periodic      = false
Parameterized = false

Task Group      Queued  Starting  Running  Failed  Complete  Lost
shipping-rates  2       0         0        0       0         0

No allocations placed

This is where I got stuck. I did not know what to check after that. I did some googling and found some suggestions with commands including eval and alloc id’s which I did not have. I think I need to find the following information:

  • Why did nomad change anything? The service was running nomad decided some action needed to be taken and now the service is not running. So question 1 why did nomad decide to change anything?
  • When did the problem begin? I assume this is directly tied to the first question. When nomad decided to make a change is probably when the issue occurred
  • Why was the change unsuccessful? In todays example I assume the jobs were stuck at queued until some timeout which lead to the dead status.

At this time we have a very small number of jobs that run under nomad. I also have a complete dev environment for trying things out. I look forward to seeing your suggestion on how you troubleshoot your own weird issues.

1 Like

With your Google research did you come across the following links?

Thank you those are helpful.

I’m still struggling to find out what happens. So far I have determined that evaluations are probably what I’m looking for. If I run nomad job status -evals I get a list of eval id’s. When I run nomad eval-status I see some information but no dates so I can’t tell if any of these evals are recent. They all say “TriggeredBy node-update” so that sounds like a start. I tried googling that term but I was not able to find out what that means.

I have had a second job fail today in the exact same way. As you can see by the output of nomad job status there are no evals or allocations. I don’t know how to get more information without these things.

nomad job status -evals -verbose projects-overview
ID = projects-overview
Name = projects-overview
Submit Date = 10/21/19 13:44:48 UTC
Type = service
Priority = 50
Datacenters = mydc
Status = dead
Periodic = false
Parameterized = false

Task Group Queued Starting Running Failed Complete Lost
projects-overview 1 0 0 0 0 0

ID Priority Triggered By Status Placement Failures

No allocations placed

I have had another incident today that affected every job on the server. In the nomad logs I found several of these:

worker: dequeued evaluation cb9fb2ae-870c-446e-a98d-b5d27fe421f4

[DEBUG] worker: nack for evaluation 6a8e77d4-f097-584c-06cb-760fdad3cf88

When I run nomad eval-status I get No evaluation(s) with prefix or id

I really need to figure out whats going on here.