SeaweedFS CSI Plugin no longer sees allocations

Hello, good people :wave:

I had previously gotten the seaweedfs-csi-driver working on Nomad. After returning from holiday and upgrading my cluster and the CSI driver, it seems it no longer works.

The peculiar thing is, if I roll back to the previous Nomad version I was running (1.0.7) and the CSI driver back to the specific revision it was built from, the problem persists, even if I bootstrap the cluster on those versions (that were previously working) from a clean state.

Right now, I’m running Nomad v1.1.2 and the latest build of the CSI driver (it isn’t versioned, sadly - I’ve opened a PR to implement that).

Here’s the manifest I’m using to create the driver:

job "plugin-seaweedfs" {
  datacenters = ["pantheon"]

  type = "system"

  constraint {
    operator = "distinct_hosts"
    value    = true
  }

  group "nodes" {
    task "plugin" {
      driver = "docker"

      config {
        image = "chrislusf/seaweedfs-csi-driver:latest"

        args = [
          "--endpoint=unix://csi/csi.sock",
          "--filer=seaweedfs-filer.service.consul:8888",
          "--nodeid=${node.unique.name}",
          "--cacheCapacityMB=1000",
          "--cacheDir=/tmp",
        ]

        privileged = true
      }

      csi_plugin {
        id        = "seaweedfs"
        type      = "node"
        mount_dir = "/csi"
      }
    }
  }
}

After applying that configuration with nomad run path/to/config.hcl, a plugin is registered, but it sees no allocations:

$ nomad plugin status -verbose seaweedfs
ID                   = seaweedfs
Provider             = <none>
Version              = <none>
Controllers Healthy  = 0
Controllers Expected = 0
Nodes Healthy        = 0
Nodes Expected       = 3

However, the allocations are all up and running successfully:

$ nomad status plugin-seaweedfs         
ID            = plugin-seaweedfs
Name          = plugin-seaweedfs
Submit Date   = 2021-07-10T15:14:12+01:00
Type          = system
Priority      = 50
Datacenters   = pantheon
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
nodes       0       0         3        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
c19385aa  8b57ca98  nodes       0        run      running  7m56s ago  7m49s ago
d3647d73  e8216113  nodes       0        run      running  7m56s ago  7m54s ago
e688508d  8f9c4ea7  nodes       0        run      running  7m56s ago  7m54s ago

Taking a look at the allocation logs, things look good (this is the same log output I was seeing when I had a working configuration):

$ nomad alloc logs -stderr -verbose -f c19385aa        
I0710 14:14:19     1 main.go:36] connect to filer seaweedfs-filer.service.consul:8888
I0710 14:14:19     1 driver.go:47] Driver: seaweedfs-csi-driver version: 1.0.0
I0710 14:14:19     1 driver.go:95] Enabling volume access mode: SINGLE_NODE_WRITER
I0710 14:14:19     1 driver.go:95] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0710 14:14:19     1 driver.go:106] Enabling controller service capability: CREATE_DELETE_VOLUME
I0710 14:14:19     1 driver.go:106] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0710 14:14:19     1 server.go:92] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}

I really can’t work out what’s going wrong here. I’ve tried re-deploying everything from scratch (everything purged, Nomad stopped, then everything removed from /var/lib/nomad/*, then re-bootstrap the cluster) both on the latest versions of the driver and Nomad, and on the versions of both that were working.

I’d really appreciate any debugging help that anyone is able to lend :bowing_man: