Nomad system jobs end up losing all allocations for no apparent reason, and not restarting them

Hi folks, I’ve been having some mighty odd behaviour out of Nomad over the past few months, unfortunately it’s been a pain in the ass to debug properly so I can’t really show much in the way of logs or anything else (since it’s also our production cluster I rather not enable debug logs etc. until absolutely necessary); so hoping this rings a bell with someone.

We have a few system jobs running that contain 2 taskgroups; 1 has a consul connect mesh gateway task, and 1 has a Traefik instance - they basically are our “ingress” nodes for all the traffic in the cluster(s). As of a few months ago, occasionally all allocations will just… stop.

The job itself shows as “running” - but when you view the job, there are no running allocations, there is also no history (summaries, I believe), which leads me to believe garbage collection got to it before I did - however, I’ve looked at it 30 minutes after this happened and still saw no history of allocations or anything else, which is odd because gc doesn’t run that fast. Unless it’s caused by gc, which would seem very weird.

There are also no running deployments, and no running/pending evaluations.

Running a nomad job eval <ingress-job> will create new allocations.

The weird thing is that this only seems to affect system jobs. The nodes they run on are “dedicated” to those particular jobs, so pre-emption is not (or should not be) an issue. There is a constraint in place that constrains the system jobs to said particular nodes (by way of a node class). In Consul, the registered job obviously switches to unhealthy, but is not deregistered (probably because Nomad considers it to be still “running”).

I’m guessing the “running” part is due to it being possible that other nodes matching the constraint appear so it pays to keep the job running, so perhaps also an idea for a feature where we can throw a flag in the job where it goes “fail this entire job if we can’t place any allocations for whatever reason” because, well, most of our monitoring looks at Consul, and Nomad’s job list, and at the moment it can’t quite make anything of it because Consul says one thing, and Nomad says something else. Granted, easy to fix on our end, but I’d rather this strange behaviour either stops, or someone can tell me what causes it so I can do something about it :smiley:

Thanks for your time y’all!

Hi Ben, I’m curious… did you manage to resolve this? I have the same problem, intermittently. This week it has happened multiple times and in one case caused a multi-hour outage of one service.

At this point I’m doing one last search for answers before I embark on converting all our nomad system jobs to systemd services managed by ansible.

I’ve not found any issue reports about this specifically, but #18267 is on my radar, as well as some general issues tracking other problems with system jobs, e.g. #12023.

Yes and no. We didn’t really find a reason as to why it happens - at least, we know why the job keeps “running”, but never figured out why it won’t allocate new allocations. Personally I think it’s all about timing, and a node not being eligible, or a fingerprinter missing something at just the right moment for an allocation to not be placed.

What we did do is tweak our monitoring a little bit where if a Nomad job shows as unhealthy in Consul, it’ll check and see if there are any allocations in the job; if there’s a discrepancy, it’ll issue a job evaluation through Nomad’s API and will give it 5 minutes to pick itself up, if it’s still showing discrepancies it will (depending on the job) either scream bloody murder on our chat app, or it will do a new evaluation with the force-reschedule flag set - if that latter one also doesn’t work, it will then again either scream bloody murder, or it will flat out purge the job, and will re-submit the job to Nomad so it basically gets fully restarted.

It’s not ideal, but at the same time, we haven’t actually had this failure mode occur as often as it did after upgrading Nomad to 1.5 - it’s now down to like, maybe once in a blue moon. It did, however, show us that our monitoring and alerting was lacking in a lot of areas so I guess it did have that as a benefit.