Cannot delete a job in my nomad cluster

GastroGee · January 20, 2025, 11:25pm

I have a weird job in my nomad cluster that i am unable to get rid of permanently.

I run

nomad job stop --purge

and after a few seconds i follow it up with a

nomad system gc

But soon afterwards i see the job back in the UI and CLI again.

I have attempted to reduce the job_eval_threshhold to 30s thinking maybe that would work, no dice! i am short of ideas … any help is appreciated.

My nomad cluster is installed with consul for service discovery.

GastroGee · February 4, 2025, 3:55pm

Additional Information; this is a system job. When i attempt to Edit the job in the UI (for example, by simply just changing the docker image version); it deploys and after a few minutes reverts to the old version. I am currently running nomad 1.8.4 (clients and servers); consul 1.19.2.

shantanugadgil · February 7, 2025, 6:19pm

Did the servers reboot unexpectedly or something?

Can you submit an altogether different job (very low cpu/ram) with the same name?

Then run system gc then system reconcile summaries

then …

stop --purge this new job …

then

system gc then system reconcile summaries.

Can you afford to update the Nomad server binaries and cleanly reboot the servers?

GastroGee · February 7, 2025, 10:19pm

I deployed a new system job with the same name but it failed and reverted to the old version.

I will try the binary update.

shantanugadgil · March 9, 2025, 1:55pm

I want to circle back to this to say that I too faced this recently.

I tried all the command tricks I myself suggested, none worked.

As it was a critical job, I renamed the job by appending a _1 to the job name and resubmitted. The job came up then.

Something for the HashiCorp team to keep an eye on.

Once I have deleted the job with a purge, I can’t think of what to monitor to report in the debug logs!

shantanugadgil · May 3, 2025, 2:27pm

No, there is no auto-submission or anything going on.

Where/How does Consul play a part in Nomad job submission. I not sure how Consul has anything to do with this at all.

BTW, I has noticed this behavior before, but I had ignored it previously as I would be debugging something else more serious.

Most of the times a delete-and-resubmit of the Nomad job works, but this time it didn’t.
Sometimes updating the Nomad binary on the compute agent also solves this problem.

I couldn’t try the “wait for some time” approach, as this particular job was a critical one.
Hence I used the “rename” trick.

Topic		Replies	Views
Nomad ghost job Nomad	1	706	June 10, 2020
How to remove a job in Nomad? Nomad	2	16982	July 5, 2021
Dead nomad job not purged by GC (Garbage Collection) Nomad jobs	3	2781	June 4, 2022
Services not deregister after Nomad stop job Nomad connect	5	1474	March 13, 2023
Nomad system jobs end up losing all allocations for no apparent reason, and not restarting them Nomad	2	587	February 21, 2024

Cannot delete a job in my nomad cluster

Related topics