Vault-agent-init stops working when new ec2 instance nodes are created

I have an EKS cluster setup with 2 node groups, 5 nodes total. I was able to get vault running fine and the vault-agent-init side car was able to mount secrets and my pods started up correctly with the vault secrets mounted. However, while doing an unrelated test the involved deleting the node ec2 instances one by one, everything came back fine except for vault. vault-agent-init now errors with permission denied to http://vault.vault.svc:8200/v1/auth/kubernetes/login and the vault server logs have:

login unauthorized due to: lookup failed: service account unauthorized; this could mean it has been deleted or recreated with a new token

I checked the service account token and it is 11 days old so has not be recreated. I can try to recreate the service account to see if that fixes it, but I would have expected this to recover automatically. Is there something additional that needs to be setup to enable automatic recovery?


Some additional details: I have the vault setup to auto-unseal with AWS KMS
Vault server started successfully after the restarts:

==> Vault server started! Log data will stream in below:

2021-09-20T21:57:37.481Z [INFO]  core: stored unseal keys supported, attempting fetch
2021-09-20T21:57:37.527Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=[::]:8201
2021-09-20T21:57:37.527Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2021-09-20T21:57:37.528Z [INFO]  core: post-unseal setup starting
2021-09-20T21:57:37.529Z [INFO]  core: loaded wrapping token key
2021-09-20T21:57:37.529Z [INFO]  core: successfully setup plugin catalog: plugin-directory=""
2021-09-20T21:57:37.531Z [INFO]  core: successfully mounted backend: type=system path=sys/
2021-09-20T21:57:37.531Z [INFO]  core: successfully mounted backend: type=identity path=identity/
2021-09-20T21:57:37.532Z [INFO]  core: successfully mounted backend: type=kv path=kv/
2021-09-20T21:57:37.532Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2021-09-20T21:57:37.537Z [INFO]  core: successfully enabled credential backend: type=token path=token/
2021-09-20T21:57:37.537Z [INFO]  core: successfully enabled credential backend: type=kubernetes path=kubernetes/
2021-09-20T21:57:37.537Z [INFO]  rollback: starting rollback manager
2021-09-20T21:57:37.537Z [INFO]  core: restoring leases
2021-09-20T21:57:37.544Z [INFO]  identity: entities restored
2021-09-20T21:57:37.544Z [INFO]  identity: groups restored
2021-09-20T21:57:37.557Z [INFO]  expiration: lease restore complete
2021-09-20T21:57:37.590Z [INFO]  core: usage gauge collection is disabled
2021-09-20T21:57:37.590Z [INFO]  core: post-unseal setup complete
2021-09-20T21:57:37.590Z [INFO]  core: vault is unsealed
2021-09-20T21:57:37.590Z [INFO]  core: unsealed with stored key

I don’t know EKS, but sounds like you’re not using a VIP/ingress IP and connecting to a specific node (and IP address), when the node restarts and get a new IP the agent connection breaks.

I tried recreating the application and service account but it still showed the error. I had to redeploy vault in order to fix it.

This is never the answer, you’re doing something else wrong.
Redeploying vault from scratch means you damaged something, and what you have described there is nothing damaging. The only remaining option is that you’re changing something else that you’re not posting here.

@aram Do you have anything constructive to add?

As per the document for the vault-init-agent, here are the annotations I’m using on my deployment: 'true' kv/****** |
          {{- with secret "kv/******" -}}
            {{ range $k, $v := }}
          export {{ $k }}='{{ $v }}'
            {{ end }}
          {{- end }} **** 'true'

But the pod does not seem to be where the problem exists.

I restarted the instance that the vault pod was on. When it came back, it had the same error.
I’m using the hasicorp/vault helm chart for installing the chart.

Is there a vault setting, or permission setting, specific to Kubernetes, that enables it to restart on a new node?

@aram FYI, I am not using the vault client or connecting to a specific IP. I’m using the vault-init-agent which takes care of getting the secrets and mounting them to make available to a pod. I’m not getting a connection error or broken connection, but permission denied. Nothing I do to the pod or agent fixes the issue. Only when I uninstall and reinstall vault is it fixed. I agree that this shouldn’t be required but it seems like there something missing from vault recovery in Kubernetes that causes this to break when a Kubernetes node that the vault pod is running on is lost.

@aram There is nothing else I am changing. To reiterate, I’m successfully deploying vault to Kubernetes (EKS) using auto unseal from AWS KMS. The vault deployment creates a PVC so that on restart it should attach to the same disk. I’m using the vault-init-agent to mount secrets on a filesystem for the pod. I’m not using the vault client directly. Everything works fine until the Kubernetes node that vault is running on stops, which causes the vault pod to move to an active node, which attaches back to the same PVC. I’ve verified this three times and it breaks every time. Is there something specific you can suggest that I can check to see why those errors are occurring?