Hello,
Today I had an issue in which a long running job became stuck at “queued” and “dead”. This issue clearly showed me that I don’t have a lot of experience troubleshooting issues in nomad. I thought I would simply learn as I go but failures in nomad are so rare I haven’t even looked at my jobs in a long time (a year?). The fix for todays issue (which is always how I fix issues) is to use terraform to simply taint the job and re-apply it. (I believe this is the equivalent of stopping and starting a job).
Since failures are so few and far between I think I need to put together a generic troubleshooting document. What exactly should I be checking for? In today’s example when I ran “nomad status ”. It had output like:
ID = shipping-rates
Name = shipping-rates
Submit Date = 11/11/19 21:00:08 UTC
Type = service
Priority = 50
Datacenters = my-data-center
Status = dead
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
shipping-rates 2 0 0 0 0 0
Allocations
No allocations placed
This is where I got stuck. I did not know what to check after that. I did some googling and found some suggestions with commands including eval and alloc id’s which I did not have. I think I need to find the following information:
- Why did nomad change anything? The service was running nomad decided some action needed to be taken and now the service is not running. So question 1 why did nomad decide to change anything?
- When did the problem begin? I assume this is directly tied to the first question. When nomad decided to make a change is probably when the issue occurred
- Why was the change unsuccessful? In todays example I assume the jobs were stuck at queued until some timeout which lead to the dead status.
At this time we have a very small number of jobs that run under nomad. I also have a complete dev environment for trying things out. I look forward to seeing your suggestion on how you troubleshoot your own weird issues.