Gist
Hi! I’m quite new to Nomad so pardon any misnomers Anyway, i’m trying to get csi-driver-nfs working on Nomad and I have these files plugin-nfs-controller.nomadplugin-nfs-nodes.nomad and when i run the jobs i get these plugins
Container Storage Interface
ID Provider Controllers Healthy/Expected Nodes Healthy/Expected
id nfs.csi.k8s.io 0/0 0/1
nfs nfs.csi.k8s.io 1/1 1/1
now even when i purge the jobs the id plugin doesn’t go away. I have no clue where it comes from. Also if I then add a volume with gitea-volume.hcl and then gitea.nomad the jobs gets stuck in pending. The container used for csi-driver-nfs is custom made, but it should be fully functional, i checked the logs, no errors are reported and it does mount, as when I didn’t have the mount executable it failed. Source for the CSI container if anyone’s interested
I managed to get some logs from Nomad, but the Gitea job won’t start now as apparently the gitea volume is in use, I have no clue by what.
After completely annihilating Nomad’s state the id csi plugin is gone and hasn’t come back. As to why the job get stucks in pending, I do have a hypothesis. Maybe, since the images created from Nix are really plain and the folder I want to mount to doesn’t exist, it tries to mount to it and gets stuck for some obscure reason. I’ll test that
EDIT: that didn’t work, the id plugin is still gone but Gitea is still pending
Ok, progress update, after fixing Consul and changing attachment_mode to block-device nomad no longer freezes, which is good but the containers never actually get created in docker and I can’t find any mention of this failure event in Docker’s log. Only nomad says Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1", which is not helpful
Another progress update:
I feel like I’m getting somewhere, so /var/lib/nomad/client/csi/node/nfs/per-alloc/01e2e644-6431-2fb4-79fc-dc8090a7e391/gitea/rw-block-device-single-node-writer is a block device created by the csi-driver-nfs and I verified that it in fact gets created, which means that’s not the issue. Then I captured what goes through the docker socket and I got this docker.log and my working hypothesis is that as we can see, the mounttype is bind even though it shouldn’t be, I don’t know where it comes from, but it definitely seems incorrect
That’s what I’d expect given the unhealthy plugin: if the plugin isn’t healthy then any job that tries to claim a volume with that plugin will be stuck in pending. There’s not really any way around that because otherwise we’d be creating the task without its volumes.
If you want more help in debugging this, it would really help to provide the jobspec for the job that’s claiming the volume and the output of nomad alloc status :alloc_id for that job’s allocations.
The job spec is in the gist, here’s a more up to date gist, reflecting the current config.
ID = 847463cf-36ad-2dd3-5978-a01d223e581c
Eval ID = bd19306f
Name = gitea.gitea[0]
Node ID = 9d419a8b
Node Name = blowhole
Job ID = gitea
Job Version = 0
Client Status = failed
Client Description = Failed tasks
Desired Status = run
Desired Description = <none>
Created = 46m59s ago
Modified = 45m58s ago
Deployment ID = 40c89881
Deployment Health = unhealthy
Allocation Addresses
Label Dynamic Address
*http yes 10.64.1.201:8666 -> 3000
Task "gitea" is "dead"
Task Resources
CPU Memory Disk Addresses
250 MHz 1.0 GiB 300 MiB
CSI Volumes:
ID Read Only
gitea false
Task Events:
Started At = 2021-01-13T13:35:08Z
Finished At = 2021-01-13T13:35:10Z
Total Restarts = 2
Last Restart = 2021-01-13T14:34:47+01:00
Recent Events:
Time Type Description
2021-01-13T14:35:10+01:00 Alloc Unhealthy Unhealthy because of failed task
2021-01-13T14:35:10+01:00 Not Restarting Exceeded allowed attempts 2 in interval 30m0s and mode is "fail"
2021-01-13T14:35:09+01:00 Terminated Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
2021-01-13T14:35:08+01:00 Started Task started by client
2021-01-13T14:34:47+01:00 Restarting Task restarting in 15.337483772s
2021-01-13T14:34:46+01:00 Terminated Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
2021-01-13T14:34:45+01:00 Started Task started by client
2021-01-13T14:34:21+01:00 Restarting Task restarting in 18.168207303s
2021-01-13T14:34:20+01:00 Terminated Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
2021-01-13T14:34:19+01:00 Started Task started by client
and that’s the most recent alloc which failed on me
Ok, so it seems like it’s an issue in my container, my mistake. Hope I wasn’t too much of a bother . Anyway, thank you so much for helping me with this little situation. And btw, nomad is really well made, since I dug around the insides a bit during this debugging session I’m even more confident in that statement!
Is there something like a close issue thing on this forum?