Transport endpoint is not connected

Hi,

in our environment we have gluster cluster and we are using a self written csi plugin to connect to a volume. In general this also works well, but sometimes we experience a failure: Transport endpoint is not connected

When I do an ls in the job container itself I get:

root@ac0c0829a9af:/opt# ls -la
ls: cannot access 'shared': Transport endpoint is not connected
total 12
drwxr-xr-x 1 root root 4096 Sep 25 14:20 .
drwxr-xr-x 1 root root 4096 Sep 25 14:20 ..
d????????? ? ?    ?       ?            ? shared
root@ac0c0829a9af:/opt# 

On the client I got data from the volume:

root@client03(nomadclient-internal-hetz):/var/lib/nomad/client/csi/monolith/csi.gluster/per-alloc/d1abed91-330a-a8fc-9906-ba12afe9be91/application/rw-file-system-multi-node-multi-writer$ ls -la
total 192
drwxrwxr-x 49 root root 4096 Sep 25 14:54 .
drwx------  3 root root 4096 Sep 25 14:19 ..
drwxr-xr-x  3 root root 4096 Sep 24 07:04 folder1
drwxr-xr-x  3 root root 4096 Sep 11 10:27 folder2

Same inside the csi plugin:

root@ec3ce43542ba:/mnt# ls -la /csi/per-alloc/d1abed91-330a-a8fc-9906-ba12afe9be91/application/rw-file-system-multi-node-multi-writer
total 192
drwxrwxr-x 49 root root 4096 Sep 25 14:54 .
drwx------  3 root root 4096 Sep 25 14:19 ..
drwxr-xr-x  3 root root 4096 Sep 24 07:04 folder1
drwxr-xr-x  3 root root 4096 Sep 11 10:27 folder2

So, mounting the volume via the csi plugin works.

But sometimes I also got the failure in the csi plugin itself. It’s really strange.

Does anyone know this error?

Hi,

I think we have the same problem
In our environment we have gluster cluster and we are using Kadulu csi plugin to connect to Nomad’s csi volumes.
Node: AlmaLinux VERSION=8.7
Gluster: glusterfs 8.6
Kadalu: version 1.0.0

Sometimes an application is no longer available due to a problem with a database (the data are persisted with kadalu).
The database job is running and healthy but in the logs, we have:
PANIC: could not open file “/var/lib/postgresql/data/global/pg_control”: Transport endpoint is not connected

When we try to restart the job, it may be in error with:
"failed to setup alloc: pre-run hook “csi_hook” failed: rpc error: code = Unknown desc = Exception calling application: [Errno 107] Transport endpoint is not connected: ‘/mnt/PROD/subvol’

When the job is restarted, we may have on a node, repetitive logs for kadalu-csi-nodeplugin like:
DEBUG [nodeserver - 150:NodeUnpublishVolume] - Received the unmount request volume=keycloak-db
although the database is restarted (volume=keycloak-db mounted).

Sometimes it is not possible to restart just the database, it seems that it is also necessary to restart the all kadalu jobs.

Can you please tell me if you have a solution to this problem?