We have vault deployed on k8s cluster and integrated with several other k8s clusters using CSI driver to provide secrets to workloads there - it all works ok but…:
list of leases for every cluster is several thousand pages long… e.g.
/vault/data $ ls -lh
total 8G
drwxrws--- 2 root vault 16.0K Nov 15 19:42 lost+found
-rw-rw---- 1 vault vault 36 Nov 15 19:42 node-id
drwxrwsr-x 3 vault vault 4.0K Nov 15 19:42 raft
-rw------- 1 vault vault 8.0G Mar 16 16:33 vault.db
and pods memory consumption:
(⎈ |gke-devops-prod:vault) ~ k top pods
NAME CPU(cores) MEMORY(bytes)
in-cluster-vault-0 13m 79Mi
in-cluster-vault-1 28m 369Mi
in-cluster-vault-2 63m 9557Mi
Is it expected? We don’t have like a big number of workloads there (maybe 20-30 with vault integration) - vault is being hit with no more than 1.5-2 RPS (not 2k rps - just 2 rps)
I believe it also translates to the fact that vault takes forever to restart / roll a new version - like 90-120 min per single pod and while this is happening other pod resource consumption goes to 11GB of data …
What could be a misconfiguration on our side?
Could you check the amount of entities that are bound to the auth method? If it’s more than expected, please have a look at the alias/service account UID used.
I will have a look at your lease settings as this seems to possibly be the issue.
any ideas @RemcoBuddelmeijer ?
I’ve reconfigured k8s auth method to have a max lease of 60s and I’m waiting to see if ~500 000 old leases will end… either way there is something not quite ok with this native k8s integration - why these leases were never revoked?
@lukpep I have had a look at all the components that are being used, this means:
Kubernetes Auth
CSI provider
As of now nothing odd has happened to me. Whenever I use the CSI provider on your exact Vault version, it all goes smoothly and only a singular lease is created.
However, this was not the case whenever a secret could not be read. Rather than 1 it would create multiple leases. One for each time it retried, but this was a finite amount and leases expired. Perhaps checking if all the secrets are read at first try within timeout through audit logging and debug logging would bring more to light?
One thing that I did want to ask you is to check your secrets store CSI driver version. Could you share this perhaps?
Other than that I didn’t know anything specific about your setup of Vault. This makes it very hard to judge what goes wrong. 2 RPS could still mean some misconfiguration that might not be caught in your specific metric.
For this I would have to know more. And since this is sensitive information I can understand if it’s out of reach. For this it’s either up to you to reach out to HashiCorp themselves or to share it either way. (If you were to share this information, please do make sure it’s cleared by whoever is in charge and disclose it securely. I recommend against sharing it in general, as it’s your own personal setup of Vault!)
Sorry if this wasn’t what you wanted to hear. CSI Driver seems to function as expected on the latest (Helm) version with no unordinary test cases.
@RemcoBuddelmeijer thanks for Your time
Regarding CSI driver - I’m using 1.0.0 from Secrets Store CSI Driver Helm Chart Repository - and I can see that the newest one is 1.1.1
What I was able to see while looking at nginx ingress controller logs (behind which is my vault)
No secret data leaked here - I’ve checked
Every 2 minutes we have login POST and GET secrets… for some reason, this is repeated in the next 5-6 seconds. Every login creates a new token and new lease I assume?
When it comes to this specific app config it’s using secret provider object:
configured with 5 keys - all coming from a sing secret path -
/v1/app-secrets/data/some-random-app/prod
What troubles me is:
why this pattern is repeated twice every 2 minutes? It seems like single login and get secret should be enough right?
while I understand why we need to check secrets every 2 minutes (secret rotation) it looks like without some kind of token caching this solution will scale poorly - we are talking about ~ 45k lease objects per month / per application (per SecretProviderClass to be exact - not sure how it will behave if we have multiple secret paths - and not only keys - under same SecretProviderClass object) - 20 apps per cluster x 4 clusters (nothing extraordinary I believe) and we have close to 4 million lease objects per month - which in our case (extrapolating from near 2 million we have) translates to vault instance with > 20 GB memory used and startup / restart times counted in hours.
So my question is - should we shorten TTL on these leases? from 1 month to 1 minute to lets say somehow keep the number of it under control? Or maybe constant tokens rewokes every 1 minute will kill the CPU?
ok - i know why this login / get pattern gets repeated after 5s - this is done for every single pod in a replica set separately… and this particular service has 2 replicas. When I scaled it to 3 I have 3x login and get secret. Not so optimal I must say
Looks to me like you might be better off using the Vault Agent rather than the CSI driver for the time being. I will have a look at the Vault CSI Provider and see what can be done to improve upon this.
The issue here really seems to be in the authentication part rather than any type of secret caching. Caching will improve the provider a lot, but the leases are a huge deal as they are being tracked in memory. A lease shouldn’t have to be created every 5s, not even every 2m.
Would having a look at the Vault Agent be something you’d be interested in?
I was not a big fan of agent (last time I’ve checked it) since it required extra pod per secret aware workload. Will validate it once again.
What is “broken” in the current CSI driver implementation in my opinion:
not making use of the lease TTL - instead rotation-poll-interval from here is what defines the number of leases created per hour / month etc. Token created via single login should be cached and reused for the TTL it was created with
CSI driver should not make requests (and logins) per pod in the replica set - it is counter-intuitive that deployment of 100 pods enforces 100 logins and 100 GETs for the same secret once every rotation-poll-interval (2 minutes by default) - it also creates inconsistency in the secret itself since this sync interval is bound to pod lifetime (counter starts when pod is created) and therefore can be a time period when the secret was refreshed in some portion of pods and not yet in other - and in max, it could last to rotation-poll-interval once again - which is definitely not desired.
If having a sidecar for every single on of your deployments isn’t an option, then that sadly enough leaves that.
I 100% agree, and I think this should be fixed in a way that it survives updates of the Secret Store CSI Driver. Right now I see a lack of API usage and rather see some than just API objects.
From looking at the GitHub Issues it does seem like they are aware of this, and have a plan in the future to work towards this. Either way this should be fixed at least momentarily.
I think this is where it starts becoming a bit hard. V1.0.0 has just been released and with that the first stable release of the CSI Driver itself. A lot of things couldn’t exactly have been put into place as there either wasn’t enough time or wasn’t sure what might have been introduced and what not.
Time will fix these issues as I am sure that Vault team is aware that making 100 requests per 100 pods isn’t do-able.
How about we start off by creating a (number of) issue(s) on the GitHub repository and link this thread? I can do this after having done some more research into the Vault CSI Provider itself.