Hi !
Playing with Linstor CSI plugin (but I guess it’d be the same with other plugins), I wonder how you manage claims for lost jobs. I declare my volumes with a “single-node-writer” access_mode, and have some tasks using them. Now, I simulate a complete failure on one of my Nomad client node, which was running some of these tasks (I just kill the VM). Nomad tries to relocate the lost tasks, but it fails as the claim on the volumes are still held by the (lost) tasks.
I cannot detach the volume as the node is gone, I get a 500 error
The only way I found was to deregister the volume with the -force flag, and register it again. Only then Nomad could run the relocated tasks.
But this requires a manual intervention, and one per volume. How do you manage this ? Is there a way for Nomad to release those claims after a timeout ?
In my case, Linstor itself did notice that the volumes where not in use anymore, so it’d have been safe the release them.
I have been in the same boat just a couple of days ago, with an unplanned outage. I’m using the Ceph CSI plugin and saw exactly the same behavior. Nomad being to reschedule the jobs due to the volumes still being attached to the broken node.
I went with a full cluster reboot though, and that also fixed the unclaimed volumes error.
But I was also wondering whether that was expected behavior. I initially thought it ws due to the fact that the broken node hosted the CSI controler plugin instance. But now it looks like it might be a more general problem.
Manual intervention seems sensible. You can’t know for sure if the node is stuck or disconnected from management network, or really dead.
Having a timeout setting doesn’t change that… Do you want your data corrupted now or after 120s?
My preference would be to run clients in VMs, and let the hypervisor do HA for host and VM. Short of that,I’d force-reboot the stuck node and force-detach CSI volume only if if I knew the job is 100% dead.
I understand the risk. But in my case, Linstor already have the protection and marks the volume as unused. Only the claim in Nomad prevents a clean failover of services. I already run my clients on VM (in Proxmox VE, in a HA stack), so if a VM node is stuck, it’s already forced-restarted. But that doesn’t solve the problem of claims which will stay held. A manual action might be the most sensible thing Nomad can do, but then, a easier way to remove them would be welcome. (something like a -force flag for detach)
I see how that happens if a VM is restarted and the container doesn’t get scheduled (so the CSI volume would no longer be used in that case). But this works thanks to Proxmox which resets unresponsive VMs, right?
If only the job app freezes while the VM itself continues to respond, wouldn’t you still need a way to restart the VM in order to release volume claim? Or is that something Linstor would detect (unresponsive container)? I know this is unrelated to the question - I’m just curious what happens in that scenario…
That sounds like a bug to me, but I’d like to see what the Hashicorp folks say.
Maybe create a Github issue?
Unable to detach for $reasons is one problem, but unable to understand the response doesn’t sound like a good reason to me.
Edit: could you make Nomad retry longer or use longer periods between retries, so that the failed VM has time to come back online before last try?