NFS on Nomad via CSI and csi-driver-nfs

Gist
Hi! I’m quite new to Nomad so pardon any misnomers :slight_smile: Anyway, i’m trying to get csi-driver-nfs working on Nomad and I have these files plugin-nfs-controller.nomad plugin-nfs-nodes.nomad and when i run the jobs i get these plugins

Container Storage Interface
ID   Provider        Controllers Healthy/Expected  Nodes Healthy/Expected
id   nfs.csi.k8s.io  0/0                           0/1
nfs  nfs.csi.k8s.io  1/1                           1/1

now even when i purge the jobs the id plugin doesn’t go away. I have no clue where it comes from. Also if I then add a volume with gitea-volume.hcl and then gitea.nomad the jobs gets stuck in pending. The container used for csi-driver-nfs is custom made, but it should be fully functional, i checked the logs, no errors are reported and it does mount, as when I didn’t have the mount executable it failed.
Source for the CSI container if anyone’s interested

I managed to get some logs from Nomad, but the Gitea job won’t start now as apparently the gitea volume is in use, I have no clue by what.

1 Like

After completely annihilating Nomad’s state the id csi plugin is gone and hasn’t come back. As to why the job get stucks in pending, I do have a hypothesis. Maybe, since the images created from Nix are really plain and the folder I want to mount to doesn’t exist, it tries to mount to it and gets stuck for some obscure reason. I’ll test that
EDIT: that didn’t work, the id plugin is still gone but Gitea is still pending

Ok, progress update, after fixing Consul and changing attachment_mode to block-device nomad no longer freezes, which is good but the containers never actually get created in docker and I can’t find any mention of this failure event in Docker’s log. Only nomad says Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1", which is not helpful

Another progress update:
I feel like I’m getting somewhere, so /var/lib/nomad/client/csi/node/nfs/per-alloc/01e2e644-6431-2fb4-79fc-dc8090a7e391/gitea/rw-block-device-single-node-writer is a block device created by the csi-driver-nfs and I verified that it in fact gets created, which means that’s not the issue. Then I captured what goes through the docker socket and I got this docker.log and my working hypothesis is that as we can see, the mount type is bind even though it shouldn’t be, I don’t know where it comes from, but it definitely seems incorrect

That’s what I’d expect given the unhealthy plugin: if the plugin isn’t healthy then any job that tries to claim a volume with that plugin will be stuck in pending. There’s not really any way around that because otherwise we’d be creating the task without its volumes.

If you want more help in debugging this, it would really help to provide the jobspec for the job that’s claiming the volume and the output of nomad alloc status :alloc_id for that job’s allocations.

The job spec is in the gist, here’s a more up to date gist, reflecting the current config.

ID                  = 847463cf-36ad-2dd3-5978-a01d223e581c
Eval ID             = bd19306f
Name                = gitea.gitea[0]
Node ID             = 9d419a8b
Node Name           = blowhole
Job ID              = gitea
Job Version         = 0
Client Status       = failed
Client Description  = Failed tasks
Desired Status      = run
Desired Description = <none>
Created             = 46m59s ago
Modified            = 45m58s ago
Deployment ID       = 40c89881
Deployment Health   = unhealthy

Allocation Addresses
Label  Dynamic  Address
*http  yes      10.64.1.201:8666 -> 3000

Task "gitea" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
250 MHz  1.0 GiB  300 MiB  

CSI Volumes:
ID     Read Only
gitea  false

Task Events:
Started At     = 2021-01-13T13:35:08Z
Finished At    = 2021-01-13T13:35:10Z
Total Restarts = 2
Last Restart   = 2021-01-13T14:34:47+01:00

Recent Events:
Time                       Type             Description
2021-01-13T14:35:10+01:00  Alloc Unhealthy  Unhealthy because of failed task
2021-01-13T14:35:10+01:00  Not Restarting   Exceeded allowed attempts 2 in interval 30m0s and mode is "fail"
2021-01-13T14:35:09+01:00  Terminated       Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
2021-01-13T14:35:08+01:00  Started          Task started by client
2021-01-13T14:34:47+01:00  Restarting       Task restarting in 15.337483772s
2021-01-13T14:34:46+01:00  Terminated       Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
2021-01-13T14:34:45+01:00  Started          Task started by client
2021-01-13T14:34:21+01:00  Restarting       Task restarting in 18.168207303s
2021-01-13T14:34:20+01:00  Terminated       Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
2021-01-13T14:34:19+01:00  Started          Task started by client

and that’s the most recent alloc which failed on me

It looks like the task is failing and not the CSI attachment. Two things to check:

  • Verify that the CSI attachment has worked by looking at the CSI plugin’s allocation logs (don’t forget to check -stderr too!)
  • See what the allocation logs for the gitea job say is the problem

Ok, so it seems like it’s an issue in my container, my mistake. Hope I wasn’t too much of a bother :grin:. Anyway, thank you so much for helping me with this little situation. And btw, nomad is really well made, since I dug around the insides a bit during this debugging session I’m even more confident in that statement!
Is there something like a close issue thing on this forum?

There isn’t, but glad to hear you got that worked out!