Cannot delete a job in my nomad cluster

I have a weird job in my nomad cluster that i am unable to get rid of permanently.

I run

nomad job stop --purge

and after a few seconds i follow it up with a

nomad system gc

But soon afterwards i see the job back in the UI and CLI again.

I have attempted to reduce the job_eval_threshhold to 30s thinking maybe that would work, no dice! i am short of ideas … any help is appreciated.

My nomad cluster is installed with consul for service discovery.

1 Like

Additional Information; this is a system job. When i attempt to Edit the job in the UI (for example, by simply just changing the docker image version); it deploys and after a few minutes reverts to the old version. I am currently running nomad 1.8.4 (clients and servers); consul 1.19.2.

Did the servers reboot unexpectedly or something?

Can you submit an altogether different job (very low cpu/ram) with the same name?

Then run system gc then system reconcile summaries

then …

stop --purge this new job …

then

system gc then system reconcile summaries.

Can you afford to update the Nomad server binaries and cleanly reboot the servers?

I deployed a new system job with the same name but it failed and reverted to the old version.

I will try the binary update.

I want to circle back to this to say that I too faced this recently.

I tried all the command tricks I myself suggested, none worked.

As it was a critical job, I renamed the job by appending a _1 to the job name and resubmitted. The job came up then.

Something for the HashiCorp team to keep an eye on.

Once I have deleted the job with a purge, I can’t think of what to monitor to report in the debug logs! :exploding_head:

No, there is no auto-submission or anything going on.

Where/How does Consul play a part in Nomad job submission. I not sure how Consul has anything to do with this at all.

BTW, I has noticed this behavior before, but I had ignored it previously as I would be debugging something else more serious.

Most of the times a delete-and-resubmit of the Nomad job works, but this time it didn’t.
Sometimes updating the Nomad binary on the compute agent also solves this problem.

I couldn’t try the “wait for some time” approach, as this particular job was a critical one.
Hence I used the “rename” trick.

1 Like