Requested docker runtime "nvidia" was not found

ravi · December 15, 2021, 4:38am

I am having some issues and I’m hoping to find some help. I am running into issues with GPU, Docker and Nomad working together.

My Docker container works on the agent node (worker node)
sudo docker run --gpus 1 -it tait:local nvidia-smi

My Nomad and Docker container work okay. When I submit a job that doesn’t require GPU it works fine.

This HCL doesn’t work as well

job "gpu-test" {
  datacenters = ["dc1"]
  type = "batch"

  group "smi" {
    task "smi" {
      driver = "docker"

      config {
        image = "nvidia/cuda:9.0-base"
        command = "nvidia-smi"
      }

      resources {
        device "nvidia/gpu" {
          count = 1

          # Add an affinity for a particular model
          affinity {
            attribute = "${device.model}"
            value     = "GeForce RTX 2070"
            weight    = 50
          }
        }
      }
    }
  }
}

My node can see the GPU so I believe the plugin is working fine.


ID              = 328c60ca-999a-bfcc-4360-6fd773e359dd
Name            = QQQQ
Class           = <none>
DC              = dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 968h37m49s
Host Volumes    = <none>
CSI Volumes     = <none>
Driver Status   = docker,exec,java,raw_exec

Node Events
Time                       Subsystem       Message
2021-12-13T21:21:32-05:00  Driver: docker  Healthy
2021-12-13T21:19:32-05:00  Driver: docker  Failed to connect to docker daemon
2021-12-13T18:55:02-05:00  Driver: docker  Healthy
2021-12-13T18:53:32-05:00  Driver: docker  Failed to connect to docker daemon
2021-11-28T20:26:54-05:00  Driver: docker  Healthy
2021-11-28T20:24:54-05:00  Driver: docker  Failed to connect to docker daemon
2021-11-04T16:00:03-04:00  Cluster         Node re-registered
2021-11-04T15:59:01-04:00  Cluster         Node heartbeat missed
2021-09-08T01:09:00-04:00  Cluster         Node reregistered by heartbeat
2021-09-08T01:07:09-04:00  Cluster         Node heartbeat missed

Allocated Resources
CPU          Memory      Disk
0/37600 MHz  0 B/16 GiB  0 B/582 GiB

Allocation Resource Utilization
CPU          Memory
0/37600 MHz  0 B/16 GiB

Host Resource Utilization
CPU            Memory          Disk
140/37600 MHz  908 MiB/16 GiB  372 GiB/937 GiB

Device Resource Utilization
nvidia/gpu/GeForce RTX 2070 SUPER[GPU-0670bb26-184b-5651-7b8d-15dc8060fb8c]  17 / 7979 MiB

Allocations
ID        Node ID   Task Group       Version  Desired  Status  Created    Modified
4e9f1b24  328c60ca  smi              1        run      failed  1h39m ago  1h39m ago
c3ae71e1  328c60ca  smi              1        stop     failed  1h39m ago  1h39m ago
fb89ee9e  328c60ca  smi              0        run      failed  3h4m ago   3h4m ago
827bde9b  328c60ca  smi              0        stop     failed  3h4m ago   3h4m ago
103dbb95  328c60ca  smi              0        run      failed  3h11m ago  3h11m ago
dc7969c9  328c60ca  smi              0        stop     failed  3h11m ago  3h11m ago
fbd84a06  328c60ca  prepros_group_1  0        run      failed  4h22m ago  4h21m ago
a0000ff6  328c60ca  prepros_group_1  0        stop     failed  4h22m ago  4h22m ago

My nomad version is:
Nomad v1.0.5 (0b870631cfa0c8e52cf698d7a7cc7989fbaec576)

And finally this is the logs from the job:


Recent Events:
Time                       Type            Description
2021-12-14T21:57:53-05:00  Killing         Sent interrupt. Waiting 5s before force killing
2021-12-14T21:57:53-05:00  Not Restarting  Error was unrecoverable
2021-12-14T21:57:53-05:00  Driver Failure  Failed to create container configuration for image "nvidia/cuda:9.0-base" ("sha256:0bedd0dfd4cb07826b29ec11be4b8346cd695a517b5f23cd60dc6cf364efdc6a"): requested docker runtime "nvidia" was not found
2021-12-14T21:57:52-05:00  Task Setup      Building Task Directory
2021-12-14T21:57:52-05:00  Received        Task received by client

I also have nvidia container runtime files.


     apt-cache search nvidia-container
libnvidia-container-dev - NVIDIA container runtime library (development files)
libnvidia-container-tools - NVIDIA container runtime library (command-line tools)
libnvidia-container1-dbg - NVIDIA container runtime library (debugging symbols)
libnvidia-container1 - NVIDIA container runtime library
nvidia-container-runtime - NVIDIA container runtime
nvidia-container-toolkit - NVIDIA container runtime hook
nvidia-container-runtime-hook - NVIDIA container runtime hook

ravi · December 15, 2021, 11:03pm

If anyone is reading this you need to setup the nvidia-runtime-container on your worker node.

Install the nvidia-container-runtime package:

But then you need to register it with docker as a runtime.
section Docker Engine setup

Topic		Replies	Views
Nomad job placement failures: Constraint missing devices filtered 9 nodes Nomad	1	632	January 26, 2024
NVIDIA docker driver documentation clarification Nomad	0	409	January 17, 2022
Docker is not installed but it is showing in the nomad node status Nomad	1	235	March 11, 2024
Nomad 1.7.x - problems with docker driver Nomad nomad	6	1242	December 24, 2023
Unable to detect docker driver Nomad	2	1020	December 12, 2023

Requested docker runtime "nvidia" was not found

Related topics