- kernel: 3.10.0
- nomad: 1.2.6
I’ve deleted and cleared the job, but the container is still running and has been running overnight without closing properly
$ nomad stop -purge -namespace ic-es node-2
==> 2022-05-07T10:38:57+08:00: Monitoring evaluation "f122a359"
2022-05-07T10:38:57+08:00: Evaluation triggered by job "node-2"
==> 2022-05-07T10:38:58+08:00: Monitoring evaluation "f122a359"
2022-05-07T10:38:58+08:00: Evaluation status changed: "pending" -> "complete"
==> 2022-05-07T10:38:58+08:00: Evaluation "f122a359" finished with status "complete"
I tried to execute the command but it still failed
$ docker update --restart=no 8af99ea02f55
$ docker rm -f 8af99ea02f55
Error response from daemon: Could not kill running container 8af99ea02f557baf1e0f5df3746cffea366ad4f24cf8a079fe1c40d65bb131b9, cannot remove - tried to kill container, but did not receive an exit event
I suspect there are two points to this question
- Restart halfway
- Kernel version is too low
I hope you can give me a way to stop the container. Thank you
At present, I know the solution is like this, but I want to solve it without restarting docker
$ rm -rf /var/lib/docker/containers/8af99ea02f557baf1e0f5df3746cffea366ad4f24cf8a079fe1c40d65bb131b9/
$ systemctl restart docker
Kernel has been updated to this version but this problem persists
Hi @x602. Thanks for using Nomad!
I’m sorry you are running into this issue. I’ve got some basic troubleshooting questions to ask if that’s ok.
- If you run
nomad status <job-name> do you still see the job as running? Probably not, but worth asking as a starting point.
- It looks like you are using Ceph with this container (
csi-cephrbd-node). Is that correct? Have you looked at this doc to make sure you have completed all the necessary steps? For instance, do you have your controller job deployed and is it healthy?
- Can you post your job files?
- Are you running this job as a
system job or a
- Do you have any server or client logs available, and if so do you see any messages that seem relevant? You might grep for
volume_manager as a start.
- If you run
nomad status <job-name> before purging, and make note of the alloc ID, and then purge the job, do you still still see a directory in the alloc dir with that ID as its name?
If you do still see a directory for the container in the alloc dir, it seems like there was a problem cleaning up the job. It might be a bug in Nomad, or it might be a misconfiguration. When developing, if I create a situation where cleanup didn’t happen correctly I run the following script to clean up orphaned alloc directories. I realize it’s not a solution, and only a temp fix, but it may help you avoid having to restart docker, but that’s a hope more than a guarantee .
systemctl stop nomad
grep <data_dir>/allocs/<alloc-id> /proc/mounts | cut -f2 -d" " | sort -r | sudo xargs umount -n
rm -rf <data_dir>/allocs/<alloc-id>
systemctl start nomad