CSI controller fails with gPRC error

Hi all,

playing around with Nomad in a home environment, and I’m running into issues with CSI.

I have a Synology DS220+ NAS in my network and would like to run all essential services there. Since it’s a “natural” single point of failure, I’m fine with my cluster going down if the NAS is down. All data resides there anyway.

Anyway, when I try to deploy a CSI controller job to the Nomad client/server running on the NAS, the job get’s killed after about half a minute. Tried a few NFS and SMB CSIs, but all show basically the same behavior.
I found the following messages in the log:

Mar 26 13:52:49 storage nomad[17771]: 2023-03-26T13:52:49.687+0200 [WARN]  client.alloc_runner.task_runner.task_hook.api: error creating task api socket: alloc_id=4969158d-6045-297a-a770-89b47d94e21f task=synology-csi-plugin path=/volume1/homelab/nomad/var/lib/nomad/alloc/4969158d-6045-297a-a770-89b47d94e21f/synology-csi-plugin/secrets/api.sock error="listen unix /volume1/homelab/nomad/var/lib/nomad/alloc/4969158d-6045-297a-a770-89b47d94e21f/synology-csi-plugin/secrets/api.sock: bind: invalid argument"
Mar 26 13:53:41 storage nomad[17771]: 2023-03-26T13:53:41.634+0200 [ERROR] client.alloc_runner.task_runner.task_hook: killing task because plugin failed: alloc_id=4969158d-6045-297a-a770-89b47d94e21f task=synology-csi-plugin error="CSI plugin failed probe: timeout while connecting to gRPC socket: failed to stat socket: stat /volume1/homelab/nomad/var/lib/nomad/client/csi/plugins/4969158d-6045-297a-a770-89b47d94e21f/csi.sock: no such file or directory"
Mar 26 13:53:41 storage nomad[17771]: 2023-03-26T13:53:41.634+0200 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=4969158d-6045-297a-a770-89b47d94e21f task=synology-csi-plugin type="Plugin became unhealthy" msg="Error: CSI plugin failed probe: timeout while connecting to gRPC socket: failed to stat socket: stat /volume1/homelab/nomad/var/lib/nomad/client/csi/plugins/4969158d-6045-297a-a770-89b47d94e21f/csi.sock: no such file or directory" failed=false
Mar 26 13:53:41 storage nomad[17771]: 2023-03-26T13:53:41.886+0200 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=4969158d-6045-297a-a770-89b47d94e21f task=synology-csi-plugin type=Killing msg="CSI plugin did not become healthy before configured 30s health timeout" failed=true
Mar 26 13:53:47 storage nomad[17771]: 2023-03-26T13:53:47.890+0200 [ERROR] client.alloc_runner.task_runner.task_hook: failed to kill task: alloc_id=4969158d-6045-297a-a770-89b47d94e21f task=synology-csi-plugin kill_reason="CSI plugin failed probe: timeout while connecting to gRPC socket: failed to stat socket: stat /volume1/homelab/nomad/var/lib/nomad/client/csi/plugins/4969158d-6045-297a-a770-89b47d94e21f/csi.sock: no such file or directory" error="context canceled"

I think the important part is in the first message: “bind: invalid argument”. Looks to me like the CSI gRPC socket could no be created.
Any idea what might cause that error?

The Syno is running a rather old version of Linux
“Linux storage 4.4.180+ #42962 SMP Tue Jan 31 23:18:09 CST 2023 x86_64 GNU/Linux synology_geminilake_220+”
Nomad is running as root already, shouldn’t be a permission issue.

Any pointers greatly appreciated.

Hi,

I had a similar issue and I solve it by increasing the health timeout in csi_plugin configuration :

      csi_plugin {
        id        = "trident-csi"
        type      = "monolith"
        mount_dir      = "/csi"
        health_timeout = "150s"