CSI plugin health after reboot

wquist · May 25, 2021, 2:10am

Hi, I’ve been trying out the CSI on Nomad 1.1.0, but I’m running into an issue where an instance of the plugin becomes unhealthy after rebooting. I was initially using csi-driver-nfs, but I’ve most recently tried the hostpath demo from the nomad repo to get it as simple as possible. In both cases, I do not create/register any volumes yet.

I’m running a small 3-node test cluster, where each node is both server and client. If I drain a node and then reboot it, the plugin starts again and becomes 3/3 healthy once the node is re-marked as eligible.

However, if I reboot without draining first, the plugin starts again (and shows running), but the plugin gets stuck at 2/3 healthy. Inside the problem container, I don’t see any errors, but there is less RPC activity than in a healthy one. Outside, I get some logs like this:

[ERROR] client.hostpath-plugin0: failed to setup instance manager client: error="failed to open grpc connection to addr: /opt/nomad/data/client/csi/monolith/hostpath-plugin0/csi.sock, err: context deadline exceeded"

This seems to be the only error in the system logs, so I’m guessing that when the undrained restarts its unable to communicate over the CSI socket and becomes unhealthy. Once I’m in this state, I’m unsure how to get the node healthy; do I have to restart the entire plugin job?

Do I just have to make sure the node is always drained, or is there some way to avoid this error? Thanks.

Topic		Replies	Views
How does CSI plugin health get determined Nomad csi	1	1407	February 28, 2022
CSI controller fails with gPRC error Nomad csi	1	825	August 31, 2023
Ceph-csi allocation error Nomad csi	2	406	April 19, 2023
Csi.sock error when trying to deploy GCP Persistent Disk on Nomad 1.8.1 Nomad	0	8	January 29, 2025
NFS on Nomad via CSI and csi-driver-nfs Nomad csi	8	5228	January 14, 2021

CSI plugin health after reboot

Related topics