I have HashiCorp Vault running as a pod in EKS, configured with PV/PVC backed by EFS for storage. Auto-unseal is set up using AWS KMS, and the storage backend is the file method. This setup has been working fine for a long time. However, recently, I’ve noticed that Vault is no longer auto-unsealing reliably. At times, I have to manually exec into the pod to check the Vault status, which shows no output, and only after running the vault operator unseal command manually does it start functioning again. I’m looking for a permanent fix for this issue.
I would suggest posting this as a bug on the Vault repository, or if you use Vault enterprise opening a support case.
I do not see anything in the current issues list on GitHub or on the know issues/fix list on developer.hashicorp.com.
They would also need additional details such as Vault version, EKS version etc.
Thanks for the response.
Here are the additional details you requested:
- Vault Version: 1.8.0
- EKS Version: 1.30
- Storage Backend: File, with Persistent Volume backed by EFS
- Auto-Unseal: Configured using AWS KMS
- Recent Change: The EKS cluster was recently migrated to EKS Managed Node Groups (Auto EKS) from a self-managed setup.
This setup had been working without issues for a long time. The problem started occurring only after the move to Auto EKS.
Are you able to bring back self managed nodes as a test in your cluster? Would be great if this is able to be replicated on a specific node type (also helpful if that proves to be the case to understand what type of instance family this is, and if that changed from the previous setup)
Yes, I can revert to using self-managed nodes, specifically for the Vault workload if needed. Just to clarify, are you suggesting to replicate the issue by isolating Vault on a test node group? If so, I can configure a dedicated node group for Vault alone rather than the entire cluster.
Also, both the previous and current setups have been using the same instance type — t3a.2xlarge.
Yes, just suggesting to migrate Vault back to self-managed nodes to see if the issue stops. If it does, and moving back again to auto nodes brings back the issue it should help narrow down the scope of what is going (are the auto-nodes being started in different AZ/subnet that maybe does not have ACL/security group open to KMS endpoints? etc).