Failure with NFS CSI volume: operation not permitted

jessebl · March 8, 2022, 2:21am

Noob here. I’m trying to use an NFS CSI volume in a job. However, when I run the job, the allocations always fail with the same error: failed to setup alloc: pre-run hook "csi_hook" failed: node plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = chmod /csi/per-alloc/a41f0226-d8f9-23c7-0c77-482cdbe290bc/test/rw-file-system-single-node-writer: operation not permitted

The volume and plugin allocations appear to be healthy as far as I can tell. But I’m not sure where to go from here. Any pointers? Details below.

Here are my specs.
test.job.nomad:

job "test" {
  datacenters = ["home"]

  group "alloc" {
    restart {
      attempts = 10
      interval = "5m"
      delay    = "25s"
      mode     = "delay"
    }

    volume "test" {
      type      = "csi"
      read_only = false
      source    = "test"
      attachment_mode = "file-system"
      access_mode     = "single-node-writer"
    }

    task "docker" {
      driver = "docker"

      volume_mount {
        volume      = "test"
        destination = "/srv"
        read_only   = false
      }

      config {
        image = "alpine"
        command = "sh"
        args = ["-c","touch /srv/test; while true; do sleep 10; ls /srv -la; done"]
      }
    }
  }
}

nfs-controller.job.nomad

# nfs-controller.job
variable "datacenters" {
  type        = list(string)
  description = "List of datacenters to deploy to."
  default     = ["home"]
}

job "plugin-nfs-controller" {
  datacenters = var.datacenters

  group "controller" {
    task "plugin" {
      driver = "docker"

      config {
        image = "mcr.microsoft.com/k8s/csi/nfs-csi:latest"

        args = [
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=${attr.unique.hostname}",
          "--logtostderr",
          "-v=5",
        ]
      }

      csi_plugin {
        id        = "nfs"
        type      = "controller"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 250
        memory = 128
      }
    }
  }
}

nfs-nodes.job.nomad

#nfs-nodes.job
variable "datacenters" {
  type        = list(string)
  description = "List of datacenters to deploy to."
  default     = ["home"]
}

job "plugin-nfs-nodes" {
  datacenters = var.datacenters

  type = "system"

  group "nodes" {
    task "plugin" {
      driver = "docker"

      config {
        image = "mcr.microsoft.com/k8s/csi/nfs-csi:latest"

        args = [
          "--endpoint=unix://csi/csi.sock",
          "--nodeid=${attr.unique.hostname}",
          "--logtostderr",
          "--v=5",
        ]

        privileged = true
      }

      csi_plugin {
        id        = "nfs"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 250
        memory = 128
      }
    }
  }
}

test-volume.nomad

# nomad volume register prometheus-volume.nomad
type = "csi"
id = "test"
name = "test"
plugin_id = "nfs"

capability {
  access_mode = "single-node-reader-only"
  attachment_mode = "file-system"
}

capability {
  access_mode = "single-node-writer"
  attachment_mode = "file-system"
}

context {
  server = "192.168.1.27"
  share = "/tank/nomad/prometheus"
}

mount_options {
  fs_type = "nfs"
}

The only fishy thing that I (a noob) see is that the volume doesn’t report any values for access and attachment modes. But maybe that’s expected behavior since no active allocations are actually using it.

$ nomad volume status test                                                   ∙ master [130]
ID                   = test
Name                 = test
External ID          = <none>
Plugin ID            = nfs
Provider             = nfs.csi.k8s.io
Version              = v3.2.0
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 3
Nodes Expected       = 3
Access Mode          = <none>
Attachment Mode      = <none>
Mount Options        = <none>
Namespace            = default

Allocations
No allocations placed

One final thing–here is the full error message from the plugin-nfs-nodes allocation:



I0307 07:55:33.391293       1 nfs.go:63] Driver: nfs.csi.k8s.io version: v3.2.0
I0307 07:55:33.391579       1 nfs.go:112] 
DRIVER INFORMATION:
-------------------
Build Date: "2022-02-27T12:32:07Z"
Compiler: gc
Driver Name: nfs.csi.k8s.io
Driver Version: v3.2.0
Git Commit: 13a7de6b1998eba7b8891db7830002f27e02d920
Go Version: go1.17
Platform: linux/amd64

Streaming logs below:
I0307 07:55:33.391689       1 mount_linux.go:208] Detected OS without systemd
I0307 07:55:33.392145       1 server.go:117] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0307 07:55:33.435791       1 utils.go:86] GRPC call: /csi.v1.Identity/GetPluginInfo
I0307 07:55:33.435813       1 utils.go:87] GRPC request: {}
I0307 07:55:33.437189       1 utils.go:93] GRPC response: {"name":"nfs.csi.k8s.io","vendor_version":"v3.2.0"}
I0307 07:55:33.438886       1 utils.go:86] GRPC call: /csi.v1.Identity/GetPluginCapabilities
I0307 07:55:33.438900       1 utils.go:87] GRPC request: {}
I0307 07:55:33.438933       1 utils.go:93] GRPC response: {"capabilities":[{"Type":{"Service":{"type":1}}}]}
I0307 07:55:33.440159       1 utils.go:86] GRPC call: /csi.v1.Node/NodeGetInfo
I0307 07:55:33.440171       1 utils.go:87] GRPC request: {}
I0307 07:55:33.440200       1 utils.go:93] GRPC response: {"node_id":"nomad1"}
I0307 07:56:09.674948       1 utils.go:86] GRPC call: /csi.v1.Node/NodePublishVolume
I0307 07:56:09.674979       1 utils.go:87] GRPC request: {"target_path":"/csi/per-alloc/32222507-61a3-c2bc-2846-4d2f65aae762/prometheus/rw-file-system-single-node-writer","volume_capability":{"AccessType":{"Mount":{"fs_type":"nfs"}},"access_mode":{"mode":1}},"volume_context":{"server":"192.168.1.27","share":"/tank/nomad/prometheus"},"volume_id":"prometheus"}
I0307 07:56:09.675539       1 nodeserver.go:95] NodePublishVolume: volumeID(prometheus) source(192.168.1.27:/tank/nomad/prometheus) targetPath(/csi/per-alloc/32222507-61a3-c2bc-2846-4d2f65aae762/prometheus/rw-file-system-single-node-writer) mountflags([])
I0307 07:56:09.675573       1 mount_linux.go:183] Mounting cmd (mount) with arguments (-t nfs 192.168.1.27:/tank/nomad/prometheus /csi/per-alloc/32222507-61a3-c2bc-2846-4d2f65aae762/prometheus/rw-file-system-single-node-writer)
I0307 07:56:09.907363       1 nodeserver.go:107] volumeID(prometheus): mount targetPath(/csi/per-alloc/32222507-61a3-c2bc-2846-4d2f65aae762/prometheus/rw-file-system-single-node-writer) with permissions(0777)
E0307 07:56:09.908104       1 utils.go:91] GRPC error: rpc error: code = Internal desc = chmod /csi/per-alloc/32222507-61a3-c2bc-2846-4d2f65aae762/prometheus/rw-file-system-single-node-writer: operation not permitted
I0307 07:56:10.918048       1 utils.go:86] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0307 07:56:10.918072       1 utils.go:87] GRPC request: {"target_path":"/csi/per-alloc/32222507-61a3-c2bc-2846-4d2f65aae762/prometheus/rw-file-system-single-node-writer","volume_id":"prometheus"}
I0307 07:56:10.918160       1 nodeserver.go:136] NodeUnpublishVolume: CleanupMountPoint /csi/per-alloc/32222507-61a3-c2bc-2846-4d2f65aae762/prometheus/rw-file-system-single-node-writer on volumeID(prometheus)
I0307 07:56:10.918189       1 mount_helper_common.go:99] "/csi/per-alloc/32222507-61a3-c2bc-2846-4d2f65aae762/prometheus/rw-file-system-single-node-writer" is a mountpoint, unmounting
I0307 07:56:10.918999       1 mount_linux.go:294] Unmounting /csi/per-alloc/32222507-61a3-c2bc-2846-4d2f65aae762/prometheus/rw-file-system-single-node-writer
W0307 07:56:10.959934       1 mount_helper_common.go:133] Warning: "/csi/per-alloc/32222507-61a3-c2bc-2846-4d2f65aae762/prometheus/rw-file-system-single-node-writer" is not a mountpoint, deleting
I0307 07:56:10.960021       1 utils.go:93] GRPC response: {}
I0307 07:56:10.961993       1 utils.go:86] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0307 07:56:10.962008       1 utils.go:87] GRPC request: {"target_path":"/csi/per-alloc/32222507-61a3-c2bc-2846-4d2f65aae762/prometheus/rw--","volume_id":"prometheus"}
E0307 07:56:10.962078       1 utils.go:91] GRPC error: rpc error: code = NotFound desc = Targetpath not found

tgross · March 8, 2022, 2:04pm

Hi @jessebl!

The only fishy thing that I (a noob) see is that the volume doesn’t report any values for access and attachment modes. But maybe that’s expected behavior since no active allocations are actually using it.

Right, those modes are only set once the volume is claimed by an allocation.

As you’ve noted, the error you’re getting is bubbling up from the CSI plugin itself:

I0307 07:56:09.675573 1 mount_linux.go:183] Mounting cmd (mount) with arguments (-t nfs 192.168.1.27:/tank/nomad/prometheus /csi/per-alloc/32222507-61a3-c2bc-2846-4d2f65aae762/prometheus/rw-file-system-single-node-writer)
I0307 07:56:09.907363 1 nodeserver.go:107] volumeID(prometheus): mount targetPath(/csi/per-alloc/32222507-61a3-c2bc-2846-4d2f65aae762/prometheus/rw-file-system-single-node-writer) with permissions(0777)
E0307 07:56:09.908104 1 utils.go:91] GRPC error: rpc error: code = Internal desc = chmod /csi/per-alloc/32222507-61a3-c2bc-2846-4d2f65aae762/prometheus/rw-file-system-single-node-writer: operation not permitted

So that looks like the plugin is trying to chmod the path and can’t. Which is weird because the plugin is running with privileged = true, just as you’re supposed to do.

Some things to check:

Is the Nomad client itself running as root?
Does the parent directory have the correct permissions?
Does your docker daemon have user namespacing configured so that root in the plugin container isn’t root in the host?
Do you have selinux or similar running?

(Also, because we’re actively working on a bunch of CSI improvements and bug fixes, it might help to know which version of Nomad are you running.)

jessebl · March 9, 2022, 6:35pm

Hi @tgross, thanks so much for the quick response! Does the below help?

Also, I understand if you want to redirect me to the NFS CSI plugin folks if this is an issue on their end.

Is the Nomad client itself running as root?

Yes, it is–I’m running an Ubuntu Cloud (Focal 20.04.04) image straight out of the box with packages from HashiCorp’s repo.

ps aux | grep nomad | grep -v grep
root         628  1.4  4.6 1371304 94024 ?       Ssl  17:49   0:09 /usr/bin/nomad agent -config /etc/nomad.d
root        1075  0.1  1.0 1359048 21540 ?       Sl   17:49   0:01 /usr/bin/nomad logmon
root        1152  0.0  1.1 1285316 24156 ?       Sl   17:49   0:00 /usr/bin/nomad docker_logger
root        1725  0.1  1.0 1432780 21580 ?       Sl   17:59   0:00 /usr/bin/nomad logmon
root        1773  0.1  1.1 732028 23636 ?        Ssl  17:59   0:00 /nfsplugin --endpoint=unix://csi/csi.sock --nodeid=nomad1 --logtostderr --v=5
root        1817  0.0  1.0 1285316 21744 ?       Sl   17:59   0:00 /usr/bin/nomad docker_logger

Does the parent directory have the correct permissions?

This should be in an NFS node allocation, right? Permissions seem to be fine for root (which is of course root anyway).

# This is the parent directory of the path from the error message of the most recent failed allocation.
# ls -la /csi/per-alloc/d4287884-55fb-85a9-92d4-1b537be34bbc/test
total 8
drwx------ 2 root root 4096 Mar  9 18:11 .
drwx------ 3 root root 4096 Mar  9 18:11 ..

Does your docker daemon have user namespacing configured so that root in the plugin container isn’t root in the host?

I don’t believe so, but may be mistaken. I’ve certainly not manually configured anything like that, and I don’t see hints on my system that point to any of the options from the Docker docs on configuring userns.

Do you have selinux or similar running?

I’ve got AppArmor running fresh out of the box. Nothing stands out in the logs to my untrained eye. Stopping the AppArmor service and then resubmitting/reallocating a job to claim the volume still fails with the same error message.

jessebl · March 10, 2022, 12:43am

Forgot to mention that I’m running this version of Nomad: Nomad v1.2.6 (a6c6b475db5073e33885377b4a5c733e1161020c)

uduncanu · June 9, 2022, 3:27pm

I’ve been having the same issue, although I seem to have had success setting the mountOptions parameter for the plugin to “0” based on the information in https://github.com/kubernetes-csi/csi-driver-nfs/blob/master/docs/driver-parameters.md

mounted folder permissions. The default is 0777 , if set as 0 , driver will not perform chmod after mount

In the example at the top of the thread, in test-volume.nomad in the context block, I added mountPermissions = "0"

That does seem to allow it to read and write to the volume from inside a container, so I’m not sure if the permissions there are actually doing anything.

thatsk · July 25, 2022, 7:44am

use this example it will solve your issue

michimau · September 21, 2022, 8:56am

https://codeberg.org/in0rdr/nomad-csi-driver-nfs-example is 404!

thatsk · September 21, 2022, 12:11pm

looksa like it has removed.

i created new one repo

thatsk · September 21, 2022, 12:12pm

the only thing is that you can register existing nfs share as volume you can not control nfs through this plugin,.

matthias · March 23, 2023, 11:49pm

First of all, thank you for the example.

When I try it out on my cluster, controller and plugin are starting as expected.
Unfortunately, when I create the volume I get the following error message:

Error creating volume: Unexpected response code: 500 (rpc error: 1 error occurred:

controller create volume: CSI.ControllerCreateVolume: volume “test” snapshot source &{“” “”} is not compatible with these parameters: rpc error: code = InvalidArgument desc = server is a required parameter)

Any idea what is causing this?

thatsk · March 24, 2023, 1:00am

You don’t need to create volume as we knw nfs export directory is there you need to only register the nfs csi volume using nomad volume register volume.nomad.
If you seen my examlle i have not created i have registered volume

thatsk · March 24, 2023, 1:02am

Let me know if that not works

matthias · March 24, 2023, 9:49am

Works like a charm, thank you so much!

Topic		Replies	Views
CSI: pre-run hook "csi_hook" failed Nomad	1	129	July 9, 2024
Cannot utilize EBS volume Nomad	1	621	April 26, 2021
Ceph-csi allocation error Nomad csi	2	449	April 19, 2023
One volume per allocation with system job Nomad csi	0	212	October 9, 2023
Job with CSI volumes doesn't deploy Nomad	6	1021	November 17, 2020

Failure with NFS CSI volume: operation not permitted

Related topics