Dear Nomad team,
first of all thank you for all your hard work, nomad is an amazing product.
We are working with nomad v1.0.3, consul v1.9.3, docker-ce 20.10.1 on ubuntu 18.04.
We are using the CSI volumes (using GitHub - hetznercloud/csi-driver: Kubernetes Container Storage Interface driver for Hetzner Cloud Volumes monolith, single write, running as a system job) and it works great.
I am trying to understand how to manage the update of these jobs, but I am clearly missing something, because I cannot make it work properly.
Since these volumes have single write capabilities, I set the max_parallel value to 0 in the update (and migrate) stanzas, so I expect the job to be killed and the new one to be spawned right after.
Nomad handles the lifecycle as expected, but then I run in these two following situations:
- the first one, that solves by itself
The jobs fails one or two times (with the following error), but then it starts and works fine
failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: controller publish: attach volume: controller attach volume: rpc error: code = Unavailable desc = failed to publish volume: server is locked
- the second one, that requires manual intervention
the job keeps on restarting, because the volume is mounted in RO mode inside the container
Terminated Exit Code: 2, Exit Message: "Docker container exited with non-zero exit code: 2"
or equivament
to fix this I need to stop the job, wait a bit and then run the job again.
I cannot provide any useful logs from the plugin, since nothing significant is printed (it is running with loglevel debug).
Is this something that you can help me understand?
Thank you
Andrea