Update stateful jobs with (hetzner) CSI volumes

Dear Nomad team,
first of all thank you for all your hard work, nomad is an amazing product.

We are working with nomad v1.0.3, consul v1.9.3, docker-ce 20.10.1 on ubuntu 18.04.

We are using the CSI volumes (using GitHub - hetznercloud/csi-driver: Kubernetes Container Storage Interface driver for Hetzner Cloud Volumes monolith, single write, running as a system job) and it works great.

I am trying to understand how to manage the update of these jobs, but I am clearly missing something, because I cannot make it work properly.

Since these volumes have single write capabilities, I set the max_parallel value to 0 in the update (and migrate) stanzas, so I expect the job to be killed and the new one to be spawned right after.

Nomad handles the lifecycle as expected, but then I run in these two following situations:

  • the first one, that solves by itself

The jobs fails one or two times (with the following error), but then it starts and works fine

failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: controller publish: attach volume: controller attach volume: rpc error: code = Unavailable desc = failed to publish volume: server is locked

  • the second one, that requires manual intervention

the job keeps on restarting, because the volume is mounted in RO mode inside the container

Terminated Exit Code: 2, Exit Message: "Docker container exited with non-zero exit code: 2" or equivament
to fix this I need to stop the job, wait a bit and then run the job again.

I cannot provide any useful logs from the plugin, since nothing significant is printed (it is running with loglevel debug).

Is this something that you can help me understand?

Thank you

Andrea

Hi Andrea

Sorry but I can not help yet, I am about to use the same stack in the same environment (hetzner cloud). I would probably create an issue in the github issue tracker of the hetzner csi plugin.

I’ll let you know if I’d run in the same issue.

René

I have a similar issue in that it seems a claimed hcloud volume is never released even if the job itself has been stopped/purged, you have to force deregister the volume, then re-register it and then it works again.

Could be related, I guess…