Nomad system jobs end up losing all allocations for no apparent reason, and not restarting them

Hi folks, I’ve been having some mighty odd behaviour out of Nomad over the past few months, unfortunately it’s been a pain in the ass to debug properly so I can’t really show much in the way of logs or anything else (since it’s also our production cluster I rather not enable debug logs etc. until absolutely necessary); so hoping this rings a bell with someone.

We have a few system jobs running that contain 2 taskgroups; 1 has a consul connect mesh gateway task, and 1 has a Traefik instance - they basically are our “ingress” nodes for all the traffic in the cluster(s). As of a few months ago, occasionally all allocations will just… stop.

The job itself shows as “running” - but when you view the job, there are no running allocations, there is also no history (summaries, I believe), which leads me to believe garbage collection got to it before I did - however, I’ve looked at it 30 minutes after this happened and still saw no history of allocations or anything else, which is odd because gc doesn’t run that fast. Unless it’s caused by gc, which would seem very weird.

There are also no running deployments, and no running/pending evaluations.

Running a nomad job eval <ingress-job> will create new allocations.

The weird thing is that this only seems to affect system jobs. The nodes they run on are “dedicated” to those particular jobs, so pre-emption is not (or should not be) an issue. There is a constraint in place that constrains the system jobs to said particular nodes (by way of a node class). In Consul, the registered job obviously switches to unhealthy, but is not deregistered (probably because Nomad considers it to be still “running”).

I’m guessing the “running” part is due to it being possible that other nodes matching the constraint appear so it pays to keep the job running, so perhaps also an idea for a feature where we can throw a flag in the job where it goes “fail this entire job if we can’t place any allocations for whatever reason” because, well, most of our monitoring looks at Consul, and Nomad’s job list, and at the moment it can’t quite make anything of it because Consul says one thing, and Nomad says something else. Granted, easy to fix on our end, but I’d rather this strange behaviour either stops, or someone can tell me what causes it so I can do something about it :smiley:

Thanks for your time y’all!