I am having some issues and I’m hoping to find some help. I am running into issues with GPU, Docker and Nomad working together.
My Docker container works on the agent node (worker node)
sudo docker run --gpus 1 -it tait:local nvidia-smi
My Nomad and Docker container work okay. When I submit a job that doesn’t require GPU it works fine.
This HCL doesn’t work as well
job "gpu-test" {
datacenters = ["dc1"]
type = "batch"
group "smi" {
task "smi" {
driver = "docker"
config {
image = "nvidia/cuda:9.0-base"
command = "nvidia-smi"
}
resources {
device "nvidia/gpu" {
count = 1
# Add an affinity for a particular model
affinity {
attribute = "${device.model}"
value = "GeForce RTX 2070"
weight = 50
}
}
}
}
}
}
My node can see the GPU so I believe the plugin is working fine.
ID = 328c60ca-999a-bfcc-4360-6fd773e359dd
Name = QQQQ
Class = <none>
DC = dc1
Drain = false
Eligibility = eligible
Status = ready
CSI Controllers = <none>
CSI Drivers = <none>
Uptime = 968h37m49s
Host Volumes = <none>
CSI Volumes = <none>
Driver Status = docker,exec,java,raw_exec
Node Events
Time Subsystem Message
2021-12-13T21:21:32-05:00 Driver: docker Healthy
2021-12-13T21:19:32-05:00 Driver: docker Failed to connect to docker daemon
2021-12-13T18:55:02-05:00 Driver: docker Healthy
2021-12-13T18:53:32-05:00 Driver: docker Failed to connect to docker daemon
2021-11-28T20:26:54-05:00 Driver: docker Healthy
2021-11-28T20:24:54-05:00 Driver: docker Failed to connect to docker daemon
2021-11-04T16:00:03-04:00 Cluster Node re-registered
2021-11-04T15:59:01-04:00 Cluster Node heartbeat missed
2021-09-08T01:09:00-04:00 Cluster Node reregistered by heartbeat
2021-09-08T01:07:09-04:00 Cluster Node heartbeat missed
Allocated Resources
CPU Memory Disk
0/37600 MHz 0 B/16 GiB 0 B/582 GiB
Allocation Resource Utilization
CPU Memory
0/37600 MHz 0 B/16 GiB
Host Resource Utilization
CPU Memory Disk
140/37600 MHz 908 MiB/16 GiB 372 GiB/937 GiB
Device Resource Utilization
nvidia/gpu/GeForce RTX 2070 SUPER[GPU-0670bb26-184b-5651-7b8d-15dc8060fb8c] 17 / 7979 MiB
Allocations
ID Node ID Task Group Version Desired Status Created Modified
4e9f1b24 328c60ca smi 1 run failed 1h39m ago 1h39m ago
c3ae71e1 328c60ca smi 1 stop failed 1h39m ago 1h39m ago
fb89ee9e 328c60ca smi 0 run failed 3h4m ago 3h4m ago
827bde9b 328c60ca smi 0 stop failed 3h4m ago 3h4m ago
103dbb95 328c60ca smi 0 run failed 3h11m ago 3h11m ago
dc7969c9 328c60ca smi 0 stop failed 3h11m ago 3h11m ago
fbd84a06 328c60ca prepros_group_1 0 run failed 4h22m ago 4h21m ago
a0000ff6 328c60ca prepros_group_1 0 stop failed 4h22m ago 4h22m ago
My nomad version is:
Nomad v1.0.5 (0b870631cfa0c8e52cf698d7a7cc7989fbaec576)
And finally this is the logs from the job:
Recent Events:
Time Type Description
2021-12-14T21:57:53-05:00 Killing Sent interrupt. Waiting 5s before force killing
2021-12-14T21:57:53-05:00 Not Restarting Error was unrecoverable
2021-12-14T21:57:53-05:00 Driver Failure Failed to create container configuration for image "nvidia/cuda:9.0-base" ("sha256:0bedd0dfd4cb07826b29ec11be4b8346cd695a517b5f23cd60dc6cf364efdc6a"): requested docker runtime "nvidia" was not found
2021-12-14T21:57:52-05:00 Task Setup Building Task Directory
2021-12-14T21:57:52-05:00 Received Task received by client
I also have nvidia container runtime files.
apt-cache search nvidia-container
libnvidia-container-dev - NVIDIA container runtime library (development files)
libnvidia-container-tools - NVIDIA container runtime library (command-line tools)
libnvidia-container1-dbg - NVIDIA container runtime library (debugging symbols)
libnvidia-container1 - NVIDIA container runtime library
nvidia-container-runtime - NVIDIA container runtime
nvidia-container-toolkit - NVIDIA container runtime hook
nvidia-container-runtime-hook - NVIDIA container runtime hook