Can't stop system job docker containers, job is stuck

I have a system job called promtail and fabio which is "Type": "system" and they are running on almost all the Nomad clients.

I can’t stop the allocations. After clicking “Stop allocation” or “Stop job” nothing happens and docker containers continue running on nodes. If I log in to the node via ssh and type docker stop promtail-.... then the docker is stopped, however, the job promtail does not schedule new allocations in Nomad. I did nomad job -purge promtail and reposted the job with nomad job run promtail.nomad. However, the job is now “stuck”, and doens’t allocate anything:

obraz

If I do:

$ nomad node status | while read -r node _ name _; do echo "$name $(nomad node status "$node" | grep promtail)"; done

I still see a lot of promtail allocations running. I can navigate to the machine and do docker stop promtail-.... Stopping the allocation in nomad ui doesn’t do anything - I can click “Stop Alloc” and refresh, and the allocation is still running.

I see nothing on Nomad client nodes - as if, the client, didn’t ever receive a message to stop the allocation.

If I stop the docker container, then:

sty 21 04:58:39 taskset[17570]:     2023-01-21T04:58:39.813-0500 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=4eeea601-735f-3bf4-0714-3e66b9fa2173 task=promtail reason="Policy allows no restarts"
sty 21 04:58:39 taskset[17570]:     2023-01-21T04:58:39.872-0500 [INFO]  agent: (runner) stopping
sty 21 04:58:39 taskset[17570]:     2023-01-21T04:58:39.873-0500 [INFO]  agent: (runner) received finish
sty 21 04:58:39 taskset[17570]:     2023-01-21T04:58:39.873-0500 [INFO]  client.gc: marking allocation for GC: alloc_id=4eeea601-735f-3bf4-0714-3e66b9fa2173

And the allocation becomes “FAILED” in Nomad UI.

This is the same with fabio that is also a system job. However, fabio has restart { attempts = 100 }, and when killing it the client restarts it immediately…

$ docker stop fabio-80a297f5-c64c-dcce-ecb2-b9867acb02fb
sty 21 05:00:17 taskset[17570]:     2023-01-21T05:00:17.125-0500 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=80a297f5-c64c-dcce-ecb2-b9867acb02fb task=fabio reason="Restart within policy" delay=15.504839464s
sty 21 05:00:32 taskset[17570]:     2023-01-21T05:00:32.658-0500 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=80a297f5-c64c-dcce-ecb2-b9867acb02fb task=fabio @module=logmon path=/home/STRIKETECH/sysavtbuild/nomad/data/alloc/80a297f5-c64c-dcce-ecb2-b9867acb02fb/alloc/logs/.fabio.stdout.fifo timestamp=2023-01-21T05:00:32.658-0500
sty 21 05:00:32 taskset[17570]:     2023-01-21T05:00:32.658-0500 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=80a297f5-c64c-dcce-ecb2-b9867acb02fb task=fabio @module=logmon path=/home/STRIKETECH/sysavtbuild/nomad/data/alloc/80a297f5-c64c-dcce-ecb2-b9867acb02fb/alloc/logs/.fabio.stderr.fifo timestamp=2023-01-21T05:00:32.658-0500
sty 21 05:00:32 taskset[17570]:     2023-01-21T05:00:32.980-0500 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=09c810d4280d7416e07ad2162b5243c6f9835b9a694d4943da4a626131792595
sty 21 05:00:33 taskset[17570]:     2023-01-21T05:00:33.432-0500 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=09c810d4280d7416e07ad2162b5243c6f9835b9a694d4943da4a626131792595

Nomad clients have volume added with directories mounted on host which are mounted on NFS. Yesterday due to maintenance we needed to restart NFS. It may be related, that’s why I’m mentioning it.

In the job listing, all the evaluations are “pending”:

$ nomad status -verbose promtail | grep pending
39cbb197-40f3-2fda-980c-a09b9cdfb333  50        node-update     pending   false
..... 226 more lines ...

Why are they pending? What are they pending for? Listing some evaluation looks like the following:

$ nomad eval status -verbose 39cbb197-40f3-2fda-980c-a09b9cdfb333
ID                 = 39cbb197-40f3-2fda-980c-a09b9cdfb333
Create Time        = 2023-01-21T00:00:25-05:00
Modify Time        = 2023-01-21T00:00:25-05:00
Status             = pending
Status Description = pending
Type               = system
TriggeredBy        = node-update
Job ID             = promtail
Namespace          = services
Node ID            = 122f5664-f173-0560-a209-7d6984e25bb0
Priority           = 50
Placement Failures = false
Previous Eval      = <none>
Next Eval          = <none>
Blocked Eval       = <none>
$ nomad eval status -json 39cbb197-40f3-2fda-980c-a09b9cdfb333
{
    "AnnotatePlan": false,
    "BlockedEval": "",
    "ClassEligibility": null,
    "CreateIndex": 1787007,
    "CreateTime": 1674277225618264137,
    "DeploymentID": "",
    "EscapedComputedClass": false,
    "FailedTGAllocs": null,
    "ID": "39cbb197-40f3-2fda-980c-a09b9cdfb333",
    "JobID": "promtail",
    "JobModifyIndex": 0,
    "ModifyIndex": 1787007,
    "ModifyTime": 1674277225618264137,
    "Namespace": "services",
    "NextEval": "",
    "NodeID": "122f5664-f173-0560-a209-7d6984e25bb0",
    "NodeModifyIndex": 1787005,
    "PreviousEval": "",
    "Priority": 50,
    "QueuedAllocations": null,
    "QuotaLimitReached": "",
    "RelatedEvals": null,
    "SnapshotIndex": 0,
    "Status": "pending",
    "StatusDescription": "",
    "TriggeredBy": "node-update",
    "Type": "system",
    "Wait": 20000000000,
    "WaitUntil": null
}

What can I do to stop all the bogus containers and properly restart system job so it schedules again? How to debug it? What else could be relevant?

Solved by bouncing scheduler, see github.