Nomad job troubleshooting

wstaples · March 9, 2020, 7:37pm

Hello,

Today I had an issue in which a long running job became stuck at “queued” and “dead”. This issue clearly showed me that I don’t have a lot of experience troubleshooting issues in nomad. I thought I would simply learn as I go but failures in nomad are so rare I haven’t even looked at my jobs in a long time (a year?). The fix for todays issue (which is always how I fix issues) is to use terraform to simply taint the job and re-apply it. (I believe this is the equivalent of stopping and starting a job).

Since failures are so few and far between I think I need to put together a generic troubleshooting document. What exactly should I be checking for? In today’s example when I ran “nomad status ”. It had output like:

ID            = shipping-rates
Name          = shipping-rates
Submit Date   = 11/11/19 21:00:08 UTC
Type          = service
Priority      = 50
Datacenters   = my-data-center
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group      Queued  Starting  Running  Failed  Complete  Lost
shipping-rates  2       0         0        0       0         0

Allocations
No allocations placed

This is where I got stuck. I did not know what to check after that. I did some googling and found some suggestions with commands including eval and alloc id’s which I did not have. I think I need to find the following information:

Why did nomad change anything? The service was running nomad decided some action needed to be taken and now the service is not running. So question 1 why did nomad decide to change anything?
When did the problem begin? I assume this is directly tied to the first question. When nomad decided to make a change is probably when the issue occurred
Why was the change unsuccessful? In todays example I assume the jobs were stuck at queued until some timeout which lead to the dead status.

At this time we have a very small number of jobs that run under nomad. I also have a complete dev environment for trying things out. I look forward to seeing your suggestion on how you troubleshoot your own weird issues.

Wolfsrudel · March 10, 2020, 12:00am

With your Google research did you come across the following links?

wstaples · March 11, 2020, 8:30pm

Thank you those are helpful.

wstaples · March 11, 2020, 8:36pm

I’m still struggling to find out what happens. So far I have determined that evaluations are probably what I’m looking for. If I run nomad job status -evals I get a list of eval id’s. When I run nomad eval-status I see some information but no dates so I can’t tell if any of these evals are recent. They all say “TriggeredBy node-update” so that sounds like a start. I tried googling that term but I was not able to find out what that means.

wstaples · March 19, 2020, 4:08pm

I have had a second job fail today in the exact same way. As you can see by the output of nomad job status there are no evals or allocations. I don’t know how to get more information without these things.

nomad job status -evals -verbose projects-overview
ID = projects-overview
Name = projects-overview
Submit Date = 10/21/19 13:44:48 UTC
Type = service
Priority = 50
Datacenters = mydc
Status = dead
Periodic = false
Parameterized = false

Summary
Task Group Queued Starting Running Failed Complete Lost
projects-overview 1 0 0 0 0 0

Evaluations
ID Priority Triggered By Status Placement Failures

Allocations
No allocations placed

wstaples · March 24, 2020, 3:35pm

I have had another incident today that affected every job on the server. In the nomad logs I found several of these:

worker: dequeued evaluation cb9fb2ae-870c-446e-a98d-b5d27fe421f4

[DEBUG] worker: nack for evaluation 6a8e77d4-f097-584c-06cb-760fdad3cf88

When I run nomad eval-status I get No evaluation(s) with prefix or id

I really need to figure out whats going on here.

SunSparc · August 27, 2021, 4:54pm

@wstaples, did you ever figure out what was going on? I am struggling to troubleshoot a job also.

gary.bright · December 19, 2024, 2:30pm

I would also be grateful if you can share how you knitted these together? my node-update error was due to it patching itself and rebooting at the OS level

Topic		Replies	Views
Terraformed dead Nomad Jobs not restarting Terraform Providers	1	521	June 5, 2019
Failed tasks not showing up Nomad	0	191	August 4, 2023
Job stuck in limbo, how to prevent this from happening? Nomad	2	452	June 22, 2022
Nomad system jobs end up losing all allocations for no apparent reason, and not restarting them Nomad	2	511	February 21, 2024
Dead batch tasks piling up Nomad	0	207	December 20, 2022

Nomad job troubleshooting

Related topics