Vault expire_num_leases very high on a specific timeframe

Dear HashiCorp Team,

We are experiencing a strange situation with our Vault Cluster deployed in an Redhat Openshift Cluster, where we face a high number of leases (to be expired) at a specific timeframe (see attachment).
This particular load appears every day at the same time window. I don’t know if this is some kind of internal process within the Vault Cluster, but I didn’t found any clear explanation.

Vault Cluster properties :

  • License OSS
  • Version 1.10.0
  • Mode HA Cluster (3 nodes)
  • Openshift Cluster 4.9

Any kind of hint would be helpful.

Thanks

You should look to identify which kind of leases these are.

But before you start, make sure you’ve understood exactly what that metric means - it’s not leases due for expiration, it’s the total number of leases being tracked by the expiration manager - i.e. all leases.

Given you have such an impressive step-change, perhaps the Vault server log or audit log has useful clues?

If not, some other interesting metrics to look at could be:

  • vault_token_creation - i.e. rate of lease creation by authentications, which is broken down by several useful labels
  • vault_secret_lease_creation - i.e. rate of lease creation by access to leased secrets - also with useful labels

It wouldn’t be internal process, there is a process or team that’s doing something they shouldn’t be. I’d suggest turning on your audit device and tracking the auth that is generating the high number of leases.

Hello Team,

We are having the same situation in our Vault Cluster.
Currently we are using the same stack, Vault OSS deployed on OpenShift Cluster, but a newer version of Vault server: v1.12.3.
The spike from the high number of leases that are about to expire happens everyday on a specific time:

We have been observing other metrics that could be related with this case, such as:

vault_token_creation - increased on the same timeframe
vault_expire_revoke - decreased on the same timeframe
vault_expire_revoke_by_token - decreased on the same timeframe
vault_expire_lease_expiration - increased on the same timeframe

I wasn’t able to find the reason that is causing that.
Any idea would help.

Thanks!

This seems to imply you have lots of things, which are all having their leases expire around the same time - which is quite possible, if they all created them around the same time, and they had the same TTLs.

implies that these things are probably freshly logging in to Vault to get replacement tokens.

IIRC the vault_token_creation has useful labels in Prometheus identifying the kinds of tokens being created, so you should have a look at what those labels are, for the timeseries that are experiencing a substantial increase.

Hello @maxb

Thank you for your fast response.

From my analyses, I can see that the highest number of leases is generated from a K8s auth method:

But the K8s Role attached to it, is configured to have a TTL of 1 minute.
As far as I understand, this leases should be deleted after their expiration, right?
So, I don’t know why this happens once a day, while it should be based on expiration of each lease.

This would seem to suggest you have a huge amount of Vault logins coming from your Kubernetes pods at this time - for example, this could be the case if a lot of scheduled nightly cron jobs are all triggering and logging in to Vault.

Vault’s audit logging support can write logs of all requests and responses - you might use this to get detailed information on the contents of this burst of requests.