Nomad job placement failures: Constraint missing devices filtered 9 nodes

bettaps · January 23, 2024, 6:17am

Hello, I am trying to run a gpu job on my Nomad client. I have one host which has nvidia driver and nvidia containertoolkit driver installed and running as a Nomad client. This is the configuration of my Nomad job:

job “gpu-test” {
type = “batch”

datacenters = ${datacenters}

region = “${nomad_region}”

namespace = “system”

group “smi” {
task “smi” {
driver = “docker”

  config {
    image = "nvidia/cuda:11.0-base"
    command = "nvidia-smi"
  }

  resources {
    cpu=500
    memory=256

    device "nvidia/gpu" {
      count = 1

      constraint {
        attribute = "${var.device_vendor}"
        value     = "nvidia"
      }

      constraint {
        attribute = "${var.device_type}"
        value     = "gpu"
      }
    }
    
  }
}

}
}

When deployed, the job fails with the error:

Placement Failures

smi

1 unplaced

Constraint missing devices filtered 9 node

Could you please let me know what could be going wrong ?

rtwolfe · January 26, 2024, 4:30am

Sounds like your Nomad job, intended to run on a GPU, is facing placement failures. Here’s a quick rundown of what to check:

1.	NVIDIA & Docker Setup: Verify that the NVIDIA drivers and NVIDIA container toolkit are correctly installed and recognized by Docker.
2.	Nomad Client Configuration: Make sure the Nomad client is configured to detect GPUs, typically through the plugin "nvidia-gpu" configuration.
3.	Job File Constraints: In your job file, ensure the device constraints are correctly set. You might want to replace interpolated variables with literal values like "device.vendor": "nvidia" for troubleshooting.
4.	Check GPU Visibility in Nomad: Use nomad node status to see if Nomad clients recognize the GPUs.
5.	Version Compatibility: Ensure compatibility between the versions of Nomad, NVIDIA drivers, and Docker.

If these steps don’t resolve the issue, checking the Nomad client logs for more specific errors would be the next step. Hope this helps! Let me know if you need further assistance.

Topic		Replies	Views
GPU support constraint missing drivers filtered 1 node Nomad	3	731	February 17, 2023
Requested docker runtime "nvidia" was not found Nomad	1	1983	December 15, 2021
Device IDs constraint not working properly Nomad	2	227	June 20, 2023
Nomad GPU tasks lifecycle stanza not working? Nomad	0	207	May 9, 2023
Constraint "missing drivers": 1 nodes excluded by filter in windows Nomad	2	1155	February 21, 2023

Nomad job placement failures: Constraint missing devices filtered 9 nodes

smi

Related topics