Every day or two I’m getting a notice from one of my monitoring tools that a system job I host on my 3 node cluster isn’t up. When I check Nomad, it says the Job is “running”, but there are no groups or allocations. The previous allocations are also not shown, so I can’t see if they were “failed” or “completed” or tell what may have caused them to go away.
I suspect I know why they are failing in the first place, I’m having an issue with the Nomad service registry forgetting that something is running until I restart the alloc (there’s a GitHub issue for this), but I’m unable to confirm because the tasks are just fine.
Is there some setting that causes these to be reaped at some interval?
Running on Nomad 1.6.1.