Cannot clean up leases

As was noted in Sys/expire/id/auth/aws/login full of entries ...? - #2 by pcolmer, we’ve got a lot of AWS leases. I’ve tried using the script at Vault may not be removing expired tokens from Consul · Issue #1815 · hashicorp/vault · GitHub but I’m not getting anywhere fast.

I’ve also tried using vault lease revoke -prefix auth/aws/login but I just get this error:

Error revoking leases with prefix auth/aws/login: context deadline exceeded

If I use the UI and navigate down to /ui/vault/access/leases/list/auth/aws/login/, it says it is getting a 500 error back from Vault.

This just took a long time to work. Leaving Vault alone overnight resulted in the desired cleanup of (most of) the AWS tokens and the UI was able to display the remainder again.

There is at least one open issue on GitHub asking for pagination in the API and I think that definitely needs to be added wherever Vault could be returning a lot of values.

Can you share the pagination GH issue? I think some folks would pile on for traction… there are issues with it, in terms of sorting, filtering, cursor-tracking of API results… but to start could help in situations like this.

Curious - how many leases and how long was their TTL?

Paginate approle/role/:role_name/secret-id-accessor/lookup · Issue #8598 · hashicorp/vault (github.com)

The TTL was one month. I’m not sure how many leases we had but the Consul snapshot size went down from 494,614,448 to 1,700,848. It has started creeping up again (now at 5,207,067) so I’ve clearly still got some scripts that aren’t revoking the IAM auth token just before the script finishes.

As a further size reference point, I’ve just invalidated all leases and the Consul snapshot is down to 334,942. I hadn’t invalidated all of the leases over the Christmas break - I was going to let some of them expire naturally but, with the numbers climbing up again, I need to get Vault back to zero leases to make it easier to spot which scripts are not revoking their leases properly.

I don’t think snapshot size is a good indicator. There’s a learn article that shows how to inspect Consul KV data to see Vault storage numbers…

I think you want to figure out a sane TTL for these tokens, ie, if your script takes 5 mins 99% of the time, set the TTL to 30 or 60 mins.

Thank you for that. I don’t think it is possible to do that with snapshots, though, so I can only inspect current Vault numbers.

Unless I’m mis-reading or misunderstanding AWS - Auth Methods - HTTP API | Vault by HashiCorp, there isn’t an option to specify a TTL for the AWS auth login process. It is these tokens that were being abandoned by the scripts.

I’ve now got things more under control. I’m checking daily currently to make sure that none of the automation processes are still leaving AWS or approle tokens lying around. So far, they are being revoked when finished with, so that is keeping things much cleaner than previously.

I’m not terrible deep on this area, but isn’t the token created based on the AWS auth role setup?

I’ve gone through the documentation again and I cannot find anything that allows me to set a lifetime on the initial login token that is generated. Maybe there is something at a higher level than “AWS Auth” (i.e. a more global configuration) but, ultimately, I think I’ve solved my particular use-case by explicitly revoking the login tokens when I’m done with them.