Nomad job placement failures: Constraint missing devices filtered 9 nodes

Hello, I am trying to run a gpu job on my Nomad client. I have one host which has nvidia driver and nvidia containertoolkit driver installed and running as a Nomad client. This is the configuration of my Nomad job:

job “gpu-test” {
type = “batch”

datacenters = ${datacenters}

region = “${nomad_region}”

namespace = “system”

group “smi” {
task “smi” {
driver = “docker”

  config {
    image = "nvidia/cuda:11.0-base"
    command = "nvidia-smi"
  }

  resources {
    cpu=500
    memory=256

    device "nvidia/gpu" {
      count = 1

      constraint {
        attribute = "${var.device_vendor}"
        value     = "nvidia"
      }

      constraint {
        attribute = "${var.device_type}"
        value     = "gpu"
      }
    }
    
  }
}

}
}

When deployed, the job fails with the error:

Placement Failures

smi

1 unplaced

  • Constraint missing devices filtered 9 node

Could you please let me know what could be going wrong ?

Sounds like your Nomad job, intended to run on a GPU, is facing placement failures. Here’s a quick rundown of what to check:

1.	NVIDIA & Docker Setup: Verify that the NVIDIA drivers and NVIDIA container toolkit are correctly installed and recognized by Docker.
2.	Nomad Client Configuration: Make sure the Nomad client is configured to detect GPUs, typically through the plugin "nvidia-gpu" configuration.
3.	Job File Constraints: In your job file, ensure the device constraints are correctly set. You might want to replace interpolated variables with literal values like "device.vendor": "nvidia" for troubleshooting.
4.	Check GPU Visibility in Nomad: Use nomad node status to see if Nomad clients recognize the GPUs.
5.	Version Compatibility: Ensure compatibility between the versions of Nomad, NVIDIA drivers, and Docker.

If these steps don’t resolve the issue, checking the Nomad client logs for more specific errors would be the next step. Hope this helps! Let me know if you need further assistance.