I have a weird job in my nomad cluster that i am unable to get rid of permanently.
I run
nomad job stop --purge
and after a few seconds i follow it up with a
nomad system gc
But soon afterwards i see the job back in the UI and CLI again.
I have attempted to reduce the job_eval_threshhold
to 30s thinking maybe that would work, no dice! i am short of ideas … any help is appreciated.
My nomad cluster is installed with consul for service discovery.
1 Like
Additional Information; this is a system job. When i attempt to Edit the job in the UI (for example, by simply just changing the docker image version); it deploys and after a few minutes reverts to the old version. I am currently running nomad 1.8.4 (clients and servers); consul 1.19.2.
Did the servers reboot unexpectedly or something?
Can you submit an altogether different job (very low cpu/ram) with the same name?
Then run system gc
then system reconcile summaries
then …
stop --purge
this new job …
then
system gc
then system reconcile summaries
.
Can you afford to update the Nomad server binaries and cleanly reboot the servers?
I deployed a new system job with the same name but it failed and reverted to the old version.
I will try the binary update.
I want to circle back to this to say that I too faced this recently.
I tried all the command tricks I myself suggested, none worked.
As it was a critical job, I renamed the job by appending a _1
to the job name and resubmitted. The job came up then.
Something for the HashiCorp team to keep an eye on.
Once I have deleted the job with a purge, I can’t think of what to monitor to report in the debug logs! 
No, there is no auto-submission or anything going on.
Where/How does Consul play a part in Nomad job submission. I not sure how Consul has anything to do with this at all.
BTW, I has noticed this behavior before, but I had ignored it previously as I would be debugging something else more serious.
Most of the times a delete-and-resubmit of the Nomad job works, but this time it didn’t.
Sometimes updating the Nomad binary on the compute agent also solves this problem.
I couldn’t try the “wait for some time” approach, as this particular job was a critical one.
Hence I used the “rename” trick.
1 Like