We are currently running into issues in our Vault environment where we have the Vault leader consistently consuming the entire amount of RAM available to it. After sitting in a constant state of memory pegged to the highest limits for days to weeks, eventually something pushes it over the edge and the process dies due to OOM. In our case, we are running Vault in Kubernetes, so this causes the Vault leader pod to restart. We notice this behavior in all cases - if our pods have a limit set to 8GB, Vault eventually ends up consuming 7.99GB consistently; if our pods have a limit set to 12GB, Vault eventually ends up consuming 11.99GB consistently; if our pods have a limit set to 24GB, Vault eventually ends up consuming 23.99GB consistently. This usually ends up being fine for a couple of weeks, but it eventually goes over the memory limit and the pod restarts.
We identified a user process which was running every 5 minutes, and would log in to Vault + pull a secret 100 times within that 5 minute window. These tokens effectively had no expiration and we now have 1 million stale leases in our environment from this process. We have tried to revoke the leases using the proper API endpoint for this, but it ends up causing Vault to crash (OOM) after running too many iterations in a row.
Our environment is:
Vault 1.5.4 (Open Source)
Kubernetes hosted - 5 pod cluster with auto-unseal (AWS KMS)
Built-in Raft storage backend
- Has anyone run into something like this before and have any ideas for what we should be checking to identify the root cause?
- It’s not clear to me what exactly Vault is doing with the memory - our Vault DB file is only 4GB. Is there documentation somewhere that explains better what is stored in memory over the lifetime of Vault running?
- Does the scenario described above with having 1 million basically stale leases seem like it could be a cause for the issue we are seeing? Does the number of leases significantly utilize memory over time to the point where we can effectively use up 10s of GBs of memory?
- Is there a good way to revoke all of the million leases we have that doesn’t also cause Vault to run out of memory?
Planning to dive into the Vault code a bit tomorrow, as well, to try and understand how this all works, but wanted to see if anyone had thoughts. Thanks!