Nomad job troubleshooting

Hello,

Today I had an issue in which a long running job became stuck at “queued” and “dead”. This issue clearly showed me that I don’t have a lot of experience troubleshooting issues in nomad. I thought I would simply learn as I go but failures in nomad are so rare I haven’t even looked at my jobs in a long time (a year?). The fix for todays issue (which is always how I fix issues) is to use terraform to simply taint the job and re-apply it. (I believe this is the equivalent of stopping and starting a job).

Since failures are so few and far between I think I need to put together a generic troubleshooting document. What exactly should I be checking for? In today’s example when I ran “nomad status ”. It had output like:

ID            = shipping-rates
Name          = shipping-rates
Submit Date   = 11/11/19 21:00:08 UTC
Type          = service
Priority      = 50
Datacenters   = my-data-center
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group      Queued  Starting  Running  Failed  Complete  Lost
shipping-rates  2       0         0        0       0         0

Allocations
No allocations placed

This is where I got stuck. I did not know what to check after that. I did some googling and found some suggestions with commands including eval and alloc id’s which I did not have. I think I need to find the following information:

  • Why did nomad change anything? The service was running nomad decided some action needed to be taken and now the service is not running. So question 1 why did nomad decide to change anything?
  • When did the problem begin? I assume this is directly tied to the first question. When nomad decided to make a change is probably when the issue occurred
  • Why was the change unsuccessful? In todays example I assume the jobs were stuck at queued until some timeout which lead to the dead status.

At this time we have a very small number of jobs that run under nomad. I also have a complete dev environment for trying things out. I look forward to seeing your suggestion on how you troubleshoot your own weird issues.

With your Google research did you come across the following links?

Thank you those are helpful.

I’m still struggling to find out what happens. So far I have determined that evaluations are probably what I’m looking for. If I run nomad job status -evals I get a list of eval id’s. When I run nomad eval-status I see some information but no dates so I can’t tell if any of these evals are recent. They all say “TriggeredBy node-update” so that sounds like a start. I tried googling that term but I was not able to find out what that means.

I have had a second job fail today in the exact same way. As you can see by the output of nomad job status there are no evals or allocations. I don’t know how to get more information without these things.

nomad job status -evals -verbose projects-overview
ID = projects-overview
Name = projects-overview
Submit Date = 10/21/19 13:44:48 UTC
Type = service
Priority = 50
Datacenters = mydc
Status = dead
Periodic = false
Parameterized = false

Summary
Task Group Queued Starting Running Failed Complete Lost
projects-overview 1 0 0 0 0 0

Evaluations
ID Priority Triggered By Status Placement Failures

Allocations
No allocations placed

I have had another incident today that affected every job on the server. In the nomad logs I found several of these:

worker: dequeued evaluation cb9fb2ae-870c-446e-a98d-b5d27fe421f4

[DEBUG] worker: nack for evaluation 6a8e77d4-f097-584c-06cb-760fdad3cf88

When I run nomad eval-status I get No evaluation(s) with prefix or id

I really need to figure out whats going on here.

@wstaples, did you ever figure out what was going on? I am struggling to troubleshoot a job also.

I would also be grateful if you can share how you knitted these together? my node-update error was due to it patching itself and rebooting at the OS level