CSI volume releasing problem

We’ve been using Nomad on some Hetzner Cloud instances for a while now and recently started dabbling with CSI to get cloud volumes mounted, however, often when restarting a job (due to changes), we’ll see things like this:

CSI volume redacted-volume has exhausted its available writer claims and is claimed by a garbage collected allocation redacted-allocation-id; waiting for claim to be released

And then it just sits there spinning it’s little wheels. The question being that if the nomad server knows the allocation that used to claim the volume has been garbage collected, why can’t it just release the claim and move on with it instead of sitting there waiting for, well, seems forever at the moment.

There also does not seem to be a way to force-release a volume; nomad volume detach doesn’t work, and complains about “unknown alloc id” (which would make sense because the alloc does not show in the list of allocations, on account of it being gc’d already).

The workaround I use now is to force-deregister the volume, re-register it, and then things are fine but I can’t keep doing that, so any ideas from anyone on how to solve this issue? :smiley:

Hi @benvanstaveren,

What version of Nomad are you currently running? I took a look through the Nomad changelog, PR’s, and issues and found #14484 and #14675 which fix similar sounding behaviour. Both these fixes are available within Nomad v1.4.0, v1.3.6, and v1.2.13.

jrasell and the Nomad team

I’m running 1.4.1 on the servers in the cluster and the agents in question :slight_smile: This erroneous behaviour also happens when stopping a job, the job itself enters the ‘dead’ state (or disappears entirely when -purge is used), but the volume still shows 1 allocation - although by then it is in completed state. Oddly enough it will show it when the job itself has been purged, and a system gc doesn’t get rid of it either, and then it’s basically proceeds as described in my initial post. Also when submitting an updated job, the “old” allocation is put in completed state, and the new allocation remains pending.

For completeness sake, the volumes are all set as single-node-writer and are attached as filesystems.

I’m also seeing this issue running v1.4.3.
Using Digital Ocean CSI

access_mode     = "single-node-writer"
attachment_mode = "file-system"
per_alloc = false
read_only = false
type      = "csi"

Constraint CSI volume xxxxxxx has exhausted its available writer claims and is claimed by a garbage collected allocation zzzzzzzz; waiting for claim to be released

As with others, it works if the volume is force de-registered and re-registered.