Hi, I’ve been trying out the CSI on Nomad 1.1.0, but I’m running into an issue where an instance of the plugin becomes unhealthy after rebooting. I was initially using csi-driver-nfs, but I’ve most recently tried the hostpath demo from the nomad repo to get it as simple as possible. In both cases, I do not create/register any volumes yet.
I’m running a small 3-node test cluster, where each node is both server and client. If I drain a node and then reboot it, the plugin starts again and becomes 3/3 healthy once the node is re-marked as eligible.
However, if I reboot without draining first, the plugin starts again (and shows running), but the plugin gets stuck at 2/3 healthy. Inside the problem container, I don’t see any errors, but there is less RPC activity than in a healthy one. Outside, I get some logs like this:
[ERROR] client.hostpath-plugin0: failed to setup instance manager client: error="failed to open grpc connection to addr: /opt/nomad/data/client/csi/monolith/hostpath-plugin0/csi.sock, err: context deadline exceeded"
This seems to be the only error in the system logs, so I’m guessing that when the undrained restarts its unable to communicate over the CSI socket and becomes unhealthy. Once I’m in this state, I’m unsure how to get the node healthy; do I have to restart the entire plugin job?
Do I just have to make sure the node is always drained, or is there some way to avoid this error? Thanks.