How to debug aws-ebs-csi-driver intermittent auth issues preventing detaching volume

I have setup a nomad cluster with a controller and clients according to this tutorial here:
https://developer.hashicorp.com/nomad/tutorials/stateful-workloads/stateful-workloads-csi-volumes.

I am able to create a job which uses the volume, so initially all seems well.

The problem occurs when the driver tries to detach the volume from the instance which a job is migrated to another node. The error is

nomad[3014]: | caused by: EnvAccessKeyNotFound: failed to find credentials in the environment.
nomad[3014]: | SharedCredsLoad: failed to load profile, .
nomad[3014]: | EC2RoleRequestError: no EC2 instance role found
nomad[3014]: | caused by: EC2MetadataError: failed to make EC2Metadata request

After completing the steps in the tutorial, I have the aws-ebs-csi-driver driver running on the controller and the clients:

# nomad plugin status
Container Storage Interface
ID        Provider         Controllers Healthy/Expected  Nodes Healthy/Expected
aws-ebs0  ebs.csi.aws.com  1/1                           3/3

I have a volume created, which I can attach and detach from the controller instance using the aws cli:

# aws ec2 describe-volumes --region eu-west-2 --filter Name=tag:Name,Values=mysql-server-vol-1 --output text
VOLUMES	eu-west-2a	2023-08-01T14:31:32.713Z	False	False	10		in-use	vol-xxxxxxxxxxxxxx	standard
ATTACHMENTS	2023-08-02T17:53:30.000Z	False	/dev/xvdaa	i-xxxxxxxxxxxxx	attached	vol-xxxxxxxxxxxxxx
TAGS	Name	mysql-server-vol-1

The IAM role and instance profile has been added to both the controller and the client instances, and I can list, attach and detach from all these instances using the profile.

The volume has been registered with nomad:

# nomad volume status
Container Storage Interface
ID     Name   Namespace  Plugin ID  Schedulable  Access Mode
mysql  mysql  default    aws-ebs0   true         single-node-writer

After deploying the mysql-server job, the volume is attached to the correct client, and the job succeeds correctly:

# nomad job status mysql-server
ID            = mysql-server
Name          = mysql-server
Submit Date   = 2023-08-01T20:02:46Z
Type          = service
Priority      = 50
Datacenters   = lab1
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group    Queued  Starting  Running  Failed  Complete  Lost  Unknown
mysql-server  0       0         1        5       4         0     0

Allocations
ID        Node ID   Task Group    Version  Desired  Status    Created     Modified
0845e28b  84848fe2  mysql-server  0        run      running   30m33s ago  12m16s ago

and I am able to exec into the docker container running the job and see the data is mounted correctly from the volume:

root@d206f94e8950:/# mysql -h localhost -p -D itemcollection -e 'select * from items;'
Enter password:
+----+----------+
| id | name     |
+----+----------+
|  1 | bike     |
|  2 | baseball |
|  3 | chair    |
|  4 | glove    |
+----+----------+

However if I do something that causes the volume to be migrated to another client, the plugin is failing to detach the volume:

nomad[3014]: 2023-08-02T17:42:35.797Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=mysql
nomad[3014]: error=
nomad[3014]: | 1 error occurred:
nomad[3014]: | \t* could not detach from controller: controller detach volume: CSI.ControllerDetachVolume: controller plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = Could not detach volume "vol-xxxxxxxxxxxxxx" from node "i-xxxxxxxxxxxxx": error listing AWS instances: NoCredentialProviders: no valid providers in chain
nomad[3014]: | caused by: EnvAccessKeyNotFound: failed to find credentials in the environment.
nomad[3014]: | SharedCredsLoad: failed to load profile, .
nomad[3014]: | EC2RoleRequestError: no EC2 instance role found
nomad[3014]: | caused by: EC2MetadataError: failed to make EC2Metadata request
nomad[3014]: | <?xml version="1.0" encoding="iso-8859-1"?>
nomad[3014]: | <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
nomad[3014]: | \t\t "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
nomad[3014]: | <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
nomad[3014]: |  <head>
nomad[3014]: |   <title>404 - Not Found</title>
nomad[3014]: |  </head>
nomad[3014]: |  <body>
nomad[3014]: |   <h1>404 - Not Found</h1>
nomad[3014]: |  </body>
nomad[3014]: | </html>
nomad[3014]: |
nomad[3014]: | \tstatus code: 404, request id:

The IAM Role attached to the instance via the profile appears to have the necessary permission

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ec2:ModifyVolume",
                "ec2:DetachVolume",
                "ec2:DescribeVolumesModifications",
                "ec2:DescribeVolumes",
                "ec2:DescribeTags",
                "ec2:DescribeInstances",
                "ec2:DescribeAvailabilityZones",
                "ec2:AttachVolume"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}

and I can validate this attached and detaching the volument from the aws cli using the profile. In fact I can see the instance profile is correctly set from the metadata:

# curl http://169.254.169.254/latest/meta-data/iam/info
{
  "Code" : "Success",
  "LastUpdated" : "2023-08-02T18:12:02Z",
  "InstanceProfileArn" : "arn:aws:iam::12345678:instance-profile/mysql-server-xxxx",
  "InstanceProfileId" : "AAAAAAAAAAAAAAAAAA"
}

I think that the driver is unable to query the metadata for the instance profile

nomad[3014]: | \t* could not detach from controller: controller detach volume: CSI.ControllerDetachVolume: controller plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = Could not detach volume "vol-xxxxxxxxxxxx" from node "i-xxxxxxxxxxxxxx": error listing AWS instances: NoCredentialProviders: no valid providers in chain
...
nomad[3014]: | EC2RoleRequestError: no EC2 instance role found

I have tried setting the verbosity and aws-sdk-debug-log settings on the driver:

        args = [
          "controller",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=11",
          "--aws-sdk-debug-log"
        ]

but it doesn’t seem to be logging anything useful here