Update stateful jobs with (hetzner) CSI volumes

chebelom · February 25, 2021, 3:34pm

Dear Nomad team,
first of all thank you for all your hard work, nomad is an amazing product.

We are working with nomad v1.0.3, consul v1.9.3, docker-ce 20.10.1 on ubuntu 18.04.

We are using the CSI volumes (using GitHub - hetznercloud/csi-driver: Kubernetes Container Storage Interface driver for Hetzner Cloud Volumes monolith, single write, running as a system job) and it works great.

I am trying to understand how to manage the update of these jobs, but I am clearly missing something, because I cannot make it work properly.

Since these volumes have single write capabilities, I set the max_parallel value to 0 in the update (and migrate) stanzas, so I expect the job to be killed and the new one to be spawned right after.

Nomad handles the lifecycle as expected, but then I run in these two following situations:

the first one, that solves by itself

The jobs fails one or two times (with the following error), but then it starts and works fine

failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: controller publish: attach volume: controller attach volume: rpc error: code = Unavailable desc = failed to publish volume: server is locked

the second one, that requires manual intervention

the job keeps on restarting, because the volume is mounted in RO mode inside the container

Terminated Exit Code: 2, Exit Message: "Docker container exited with non-zero exit code: 2" or equivament
to fix this I need to stop the job, wait a bit and then run the job again.

I cannot provide any useful logs from the plugin, since nothing significant is printed (it is running with loglevel debug).

Is this something that you can help me understand?

Thank you

Andrea

resmo · May 10, 2021, 8:35am

Hi Andrea

Sorry but I can not help yet, I am about to use the same stack in the same environment (hetzner cloud). I would probably create an issue in the github issue tracker of the hetzner csi plugin.

I’ll let you know if I’d run in the same issue.

René

benvanstaveren · May 11, 2021, 8:42am

I have a similar issue in that it seems a claimed hcloud volume is never released even if the job itself has been stopped/purged, you have to force deregister the volume, then re-register it and then it works again.

Could be related, I guess…

rymg19 · June 20, 2023, 12:56am

Hate to necrobump this, but I’ve been trying to use the hcloud CSI plugin, and I think I’m running into a bunch of different issues, one of them the same as this. Were you ever able to find a solution, or at least a more stable workaround than the force-deregister?

chebelom · June 21, 2023, 1:47pm

Hey @rymg19 , I can’t help, I’m sorry.
I gave up after a while and used host volumes

benvanstaveren · June 23, 2023, 8:03am

We upgraded to a newer Nomad version (I believe 1.5.1) and that solved the issue we’ve been having.

Topic		Replies	Views
Job with CSI volumes doesn't deploy Nomad	6	1017	November 17, 2020
CSI volume releasing problem Nomad csi	3	837	December 17, 2022
New Guides for Nomad beta v0.11.0! Nomad task-dependencies , csi , jobs	8	1353	May 11, 2022
NFS on Nomad via CSI and csi-driver-nfs Nomad csi	8	5274	January 14, 2021
Digitalocean CSI issues Nomad csi	13	2268	March 1, 2022

Update stateful jobs with (hetzner) CSI volumes

Related topics