GCE Persistent Disk CSI: "csi_hook" failed: claim volumes: rpc error: Permission denied"

Hi,
I ran through the example config here and encountered this error when deploying the mysql job:

Time                       Type           Description
2020-05-05T17:02:48-04:00  Setup Failure  failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: Permission denied
2020-05-05T17:02:48-04:00  Received       Task received by client

I am wondering if you can help me identify where the missing permission is coming from?

client logs:

May  5 21:04:19 nmd-rpzq nomad[19194]:     2020-05-05T21:04:19.041Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=CSIVolume.Claim server=<nomad_server_ip>:4647
May  5 21:04:19 nmd-rpzq nomad[19194]:     2020-05-05T21:04:19.041Z [ERROR] client.alloc_runner: prerun failed: alloc_id=d1ece247-eb45-1a71-f4b0-424db8701926 error="pre-run hook "csi_hook" failed: claim volumes:
rpc error: Permission denied"

Service account role permission:

title: "Google Compute Engine Persistent Disk CSI Driver Custom Roles"
description: Custom roles required for functions of the gcp-compute-persistent-disk-csi-driver
stage: ALPHA
includedPermissions:
- compute.instances.get
- compute.instances.attachDisk
- compute.instances.detachDisk
- compute.disks.get
- compute.disks.use
- iam.serviceAccounts.actAs

Hi @vincenthuynh!

It looks like this error is coming from Nomad itself and not GCP (although it’s possible the wording matches). Are you using ACLs in your Nomad cluster? If so, do you have the csi-mount-volume permission in your policy?

If you’re not using ACLs, I’d check the allocation logs for the CSI plugins to see if there’s more information available there.

Hey @tgross,
Thanks for the reply!

We do have ACLs enabled but the Anonymous policy is pretty wide-open in the environment we’re testing this in. We’re using the write policy which contains the csi-mount-volume permission.

namespace "*" {
  policy       = "write"
  capabilities = ["alloc-node-exec", "csi-register-plugin", "csi-list-volume", "csi-read-volume"]
}

agent {
  policy = "write"
}

operator {
  policy = "write"
}

quota {
  policy = "write"
}

node {
  policy = "write"
}

host_volume "*" {
  policy = "write"
}

Edit: I disabled ACLs and was able to get around this error. Please let me know if there’s something I’m missing in my policy or if I should log an issue.

Hm, that should be working for you. Yes, if you could open an issue that would be really helpful. Thanks!

Hey @tgross, I was able to get this working as well, by allowing csi-mount-volume into my Nomad anonymous policy. However, this doesn’t seem like something we should have to allow anonymously! Is there a way to properly lock this permission down without granting it to the anonymous policy?

Hi @holtwilkins! You need to set the permissions for the policy that you want to allow access. So you can set csi-mount-volume for anonymous or for whatever other more specific policy you’d like. The learn guide on ACLs shows how you might do this: https://learn.hashicorp.com/nomad/acls/create_policy

(The original issue here was that the plugin read policy wasn’t set. See https://github.com/hashicorp/nomad/issues/7927)

Thanks @tgross. We’ve been using acls for years with no issues. I’m now trying to roll out csi support, but there’s no example I could find that shows how to configure the node and controller jobs when you’re using nomad acls? So I guess, I know how to create a custom nomad acl policy that will do this, but who do I grant this policy to so that my job doesn’t get this rpc access denied when it tries to run?

We definitely could use some better docs here, but in meanwhile I can probably point you in the right direction with a bit more information. Where are you getting the permissions error? When you run the plugin job? When you register the volume? Or when you run a job that claims the volume?

Ah, hey @tgross, I didn’t get a notification from this, so didn’t realize you’d responded!

My main issue was that a job trying to claim the volume was failing to do so. IIRC there were errors in the plugin jobs. Once I added csi-mount-volume to my anonymous policy, the plugin jobs started working. If the plugin jobs need a specific Nomad ACL, what is that, and what’s the recommended way to set them (i.e. ephemeral Nomad tokens based on a role defined in Vault, or should this only be a static token so the plugin jobs are never bounced)?

Do the downstream jobs that need to claim volumes need any special Nomad ACL permissions, or is all they need handled by the permissions you set on the plugin jobs?

Additionally, as per https://github.com/hashicorp/nomad/issues/8057, I’m seeing permission denied in the logs of Nomad servers after all allocations using a volume have completed, preventing deregister of no-longer-in-use volumes. Seems like this is probably related to why Nomad forever thinks the volume is “in use” long after all its allocations are gone.

Note that I’ve upgraded this cluster to 0.11.3, and reproduced the issue with a brand-new registration.

2020-06-09T07:48:28.592Z [DEBUG] worker: dequeued evaluation: eval_id=6ca1671b-1782-905f-5a5e-3738777cdeac
2020-06-09T07:48:28.592Z [DEBUG] core.sched: forced job GC
2020-06-09T07:48:28.592Z [DEBUG] core.sched: forced eval GC
2020-06-09T07:48:28.592Z [DEBUG] core.sched: eval GC found eligibile objects: evals=6 allocs=0
2020-06-09T07:48:28.594Z [DEBUG] core.sched: forced deployment GC
2020-06-09T07:48:28.594Z [DEBUG] core.sched: forced plugin GC
2020-06-09T07:48:28.594Z [DEBUG] core.sched: CSI plugin GC scanning before cutoff index: index=18446744073709551615 csi_plugin_gc_threshold=1h0m0s
2020-06-09T07:48:28.594Z [ERROR] core.sched: failed to GC plugin: plugin_id=aws-ebs0 error=“Permission denied”
2020-06-09T07:48:28.594Z [ERROR] worker: error invoking scheduler: error=“failed to process evaluation: Permission denied”
2020-06-09T07:48:28.594Z [DEBUG] worker: nack evaluation: eval_id=6ca1671b-1782-905f-5a5e-3738777cdeac

That feels like the smoking gun here, so what permissions might be missing where?

@holtwilkins how can I edit anonymous policy?

I have the same issue with ACL setup… also I’ve create a new policy

namespace "*" {
  policy = "write"
  capabilities = [
    "alloc-node-exec",
    "csi-register-plugin",
    "csi-list-volume",
    "csi-read-volume",
    "csi-write-volume",
    "csi-mount-volume",
    "read-logs",
    "read-fs"
  ]
}

agent {
  policy = "write"
}

operator {
  policy = "write"
}

quota {
  policy = "write"
}

node {
  policy = "write"
}

host_volume "*" {
  policy = "write"
}

plugin {
  policy = "read"
}

and when I create a job with this token I still get

failed to setup alloc: pre-run hook “csi_hook” failed: claim volumes: rpc error: rpc error: Permission denied

any thoughts?

anonymous is a reserved name for the anonymous policy

nomad acl policy apply -token=xxxxxx -description "Anonymous policy" anonymous anon.policy

thanks @vincenthuynh - it’s so obvious

it works now :+1: