CSI volume unable to mount on system jobs

I’m pretty new to Nomad, having just starting working on setting it up on some raspberry pis in my homelab, and I’m running into an issue getting system type jobs (specifically Traefik) to attach NFS shares via the CSI plugin (to store certificates from Lets Encrypt).

I can start the jobs for controller.nomad and node.nomad with no issues, and I can create the letsencrypt volume (all files included below), with no issues and they all show up in the CLI/GUIs for Nomad. When I try to start the Traefik job with type = "system", however, I get the following output:

==> 2023-01-08T06:07:36Z: Monitoring evaluation “7aba7d02”
2023-01-08T06:07:36Z: Evaluation triggered by job “traefik”
2023-01-08T06:07:37Z: Evaluation status changed: “pending” → “complete”
==> 2023-01-08T06:07:37Z: Evaluation “7aba7d02” finished with status “complete” but failed to place all allocations:
2023-01-08T06:07:37Z: Task Group “traefik” (failed to place 1 allocation):
* Constraint “missing CSI Volume letsencrypt”: 3 nodes excluded by filter

If I change the type over to service, the job starts and attaches to the volume with no issues and can write to the share. Switching the type back to system causes the behavior to happen again.

What am I missing that will allow the job, when run as a system job, attach to the volume? Is there a better way of attaching an NFS volume to the Traefik containers?

  • I’m running Nomad v1.4.3
  • I’m using my workstation to run an NFS server and all the NFS shares are mountable/accessible by the Raspberry Pis when I manually mount them
  • I’m running Traefik for routing and obtaining Lets Encrypt certificates for services on the network
  • I’ve tried with both the k8s NFS CSI plugin and with this NFS CSI plugin with the same behavior from both.

controller.nomad

job "plugin-nfs-controller" {
  datacenters = ["homelab"]
  type = "system"
  group "controller" {
    task "plugin" {
      driver = "docker"

      config {
        image = "registry.k8s.io/sig-storage/nfsplugin:v4.1.0"
        args = [
          "--v=5",
          "--nodeid=${attr.unique.hostname}",
          "--endpoint=unix:///csi/csi.sock",
          "--drivername=nfs.csi.k8s.io"
        ]

        privileged = true
      }

      csi_plugin {
        id = "nfsofficial"
        type = "controller"
        mount_dir = "/csi"
      }

      resources {
        memory = 32
        cpu = 100
      }
    }
  }
}

node.nomad

job "plugin-nfs-nodes" {
  datacenters = ["homelab"]
  type = "system"

  group "nodes" {
    task "plugin" {
      driver = "docker"

      config {
        image = "registry.k8s.io/sig-storage/nfsplugin:v4.1.0"
        args = [
          "--v=5",
          "--nodeid=${attr.unique.hostname}",
          "--endpoint=unix:///csi/csi.sock",
          "--drivername=nfs.csi.k8s.io",
        ]
        privileged = true
      }

      csi_plugin {
        id = "nfsofficial"
        type = "node"
        mount_dir = "/csi"
      }

      resources {
        memory = 10
      }
    }
  }
}

letsencrypt.volume

type = "csi"
id = "letsencrypt"
name = "letsencrypt"
plugin_id = "nfsofficial"
external_id = "letsencrypt"

capability {
  access_mode = "multi-node-multi-writer"
  attachment_mode = "file-system"
}

parameters {
  server = "192.168.1.190"
  share = "/mnt/homelab/letsencrypt"
  mountPermissions = "0"
}

mount_options {
  fs_type = "nfs"
  mount_flags = [ "timeo=30", "intr", "vers=3", "_netdev", "nolock" ]
}

traefik.nomad

job "traefik" {
  region = "global"
  datacenters = ["homelab"]
  type = "system"

  group "traefik" {
    count = 1

    network {
       ...
    }

    service {
      name = "traefik-http"
      provider = "nomad"
      port = "http"

      tags = [
        "traefik.enable=true",
      ]
    }

    volume "letsencrypt" {
      type = "csi"
      source = "letsencrypt"
      attachment_mode = "file-system"
      read_only = false
      access_mode = "multi-node-multi-writer"
    }

    task "server" {
      driver = "docker"

      volume_mount {
        volume = "letsencrypt"
        destination = "/opt/acme"
        read_only = false
      }

      config {
        image = "traefik:latest"
        ...

Hi @SenseiRat, I recently hit a similar problem (albeit with a different CSI plugin). It seems that this is a bug in CSI volume handling for System type jobs. It’s already fixed and will be released in 1.4.4.

See this bug: sysbatch/system type jobs fail to be scheduled when using a multi-node-multi-writer CSI volume · Issue #15094 · hashicorp/nomad · GitHub

@mmeier86 Thanks for letting me know. I guess I’ll be eagerly awaiting v1.4.4 to do this correctly.

In the meantime for anyone else, and I don’t know why I didn’t try this earlier, I was able to get it running mostly in the same way by changing the Traefik job to be type = service, set count = 3 (3 is the number of nodes I have), and then add this constraint. I haven’t seen any issues with it so far, but I also just started playing with it again this morning.

constraint {
  operator = "distinct hosts"
  value = "true"
}