I’m failing to deploy a container that is attempting to use a volume that’s using a ceph_csi
plugin:
failed to setup alloc: pre-run hook "csi_hook" failed: node plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = failed to establish the connection: failed to get connection: connecting failed: rados: ret=-110, Connection timed out
I see this in the nomad logs:
Jan 28 00:22:01 ind-test-nomad-worker11 nomad[208094]: 2023-01-28T00:22:01.022-0500 [WARN] client.ceph-csi: finished client unary call: grpc.code=Internal duration=50m0.016328065s grpc.service=csi.v1.Node grpc.method=NodeStageVolume
Jan 28 00:22:01 ind-test-nomad-worker11 nomad[208094]: 2023-01-28T00:22:01.022-0500 [ERROR] client.alloc_runner: prerun failed: alloc_id=b6c05535-82b8-f5d7-a65f-0960daf0b087 error="pre-run hook \"csi_hook\" failed: node plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = failed to establish the connection: failed to get connection: connecting failed: rados: ret=-110, Connection timed out"
I see this in the ceph node container logs:
I0128 05:30:42.057159 7 utils.go:195] ID: 5090 Req-ID: 0001-0024-1e35f6bc-1257-45b6-aa9d-16f9ecd30652-0000000000000024-dc31d910-9ecc-11ed-b9a0-9238a2ead1d6 GRPC call: /csi.v1.Node/NodeStageVolume
I0128 05:30:42.057365 7 utils.go:206] ID: 5090 Req-ID: 0001-0024-1e35f6bc-1257-45b6-aa9d-16f9ecd30652-0000000000000024-dc31d910-9ecc-11ed-b9a0-9238a2ead1d6 GRPC request: {"secrets":"***stripped***","staging_target_path":"/local/csi/staging/prometheus-us-ind-test/rw-file-system-single-node-writer","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"*********************","imageFeatures":"layering","imageName":"csi-vol-dc31d910-9ecc-11ed-b9a0-9238a2ead1d6","journalPool":"ind-nonprod2","pool":"ind-nonprod2"},"volume_id":"0001-0024-1e35f6bc-1257-45b6-aa9d-16f9ecd30652-0000000000000024-dc31d910-9ecc-11ed-b9a0-9238a2ead1d6"}
I0128 05:30:42.057659 7 rbd_util.go:1279] ID: 5090 Req-ID: 0001-0024-1e35f6bc-1257-45b6-aa9d-16f9ecd30652-0000000000000024-dc31d910-9ecc-11ed-b9a0-9238a2ead1d6 setting disableInUseChecks: false image features: [layering] mounter: rbd
FWIW, I am able to manually mount this via
rbd device map ind-nonprod2/test_image --id ind-nonprod2 --keyfile ceph_secret.txt
The thing that’s really leaving me scratching my head is that the allocation isn’t even starting, there is no task for the job. Presumably this is because the volume is not registered even though it says it’s registered in the nomad UI.
I’m sure I’m missing something here, but I’m not sure where to start.