Docker container created by Nomad stuck in created state

Running into a situation where one of the nomad jobs I start always ends up in a pending state. While investigating this I found that the docker container that this job was creating gets stuck in a “Created” state, where even me doing “docker inspect” on the container hangs.

  1. Any pointers on how to get the docker run command nomad ran for my container so that I can try to manually recreate what happens.
  2. Any pointers on how to investigate this furter?
  3. Is there some way I can get this setup running under strace the underlying issue?

Nomad log shows

2022-09-29T16:58:13.561Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=8b683527-d053-7678-4b1f-5856b4353b05 task=nginx-compute error="failed to create container: Failed to inspect container 123138737ab823aae127f998706561ddfc10246ab69492e12c0eb451c1b3c69a: Get \"http://unix.sock/containers/123138737ab823aae127f998706561ddfc10246ab69492e12c0eb451c1b3c69a/json?\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2022-09-29T16:58:13.561Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=8b683527-d053-7678-4b1f-5856b4353b05 task=nginx-compute reason="Restart within policy" delay=17.995890109s

Docker info shows

root@compute-00001:/var/lib/docker/volumes/sbin_20220917T015206/_data>docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.9.1-docker)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 11
  Running: 8
  Paused: 0
  Stopped: 3
 Images: 9
 Server Version: 20.10.18
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay weavemesh
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.18.0-372.19.1.el8_6.x86_64
 Operating System: Rocky Linux 8.6 (Green Obsidian)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.11GiB
 Name: compute-00001.dt-gcp-sandbox.lan
 ID: DDNL:XIPS:UGTL:YVKS:WHZZ:P2ND:XFEO:XSHD:3VGU:H75X:5LXX:QEQI
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false


Hi @deepankarsharm, I remember running into this problem with some regularity on older Linux kernels (nothing to do with Nomad, this is caused by docker/linux interactions). TBH I wouldn’t put much effort into tracking down the problem until after upgrading to something like 5.15 or later.

You should be able to cleanup the stuck container with

docker container prune --force

Ah - this is good to know. I will try to get my app running on a newer kernel to see if the issue goes away.