Caught a bit of a break. The two nodes mentioned above were no longer indicating they knew anything about my secrets or existing mounts, but the one whose sd card went berserk actually still had an intact copy of the vault and raft dbs. I just copied the entire data dir and config and moved it over to my laptop to see if I could untangle the mess.
On my laptop (a 6-core tiger lake processor with 16GB of RAM) spinning up the node started a massive revocation project which twice caused the machine to run out of memory and reboot itself.
After spending a whole evening churning on revocations I am now just getting a long string of errors which state the following:
[ERROR] expiration: failed to revoke lease: lease_id=pki_int/issue/agent-cluster/1vpybmzDNroHzYWdVDvSVSOm error="failed to revoke entry: resp: (*logical.Response)(nil) err: error encountered during CRL building: error storing CRL: put failed due to value being too large; got 1413409 bytes, max: 1048576 bytes"
A bit of Googling reveals what I already know and am attempting to remedy; the CRL is too full, I should stop storing leases, etc. But I’m not sure what the actual steps I have available to me are.
Running a tidy process with a 1s safety buffer against my pki_int mount while the agent is alive does not appear to have much effect. Additionally, there doesn’t appear to be a setting that I can find which would allow me to raise that maximum temporarily just to get me out of this jam.
At this point, every time I run the server agent, it boots up, and then attempts to expire all these certs but can’t. After a duration of my laptop being maxed out, I get the error message: http2: received GOAWAY [FrameHeader GOAWAY len=8], starting graceful shutdown
, which appears to indicate that the agent reached an idle timeout threshold, which seems curious given the amount of effort being put forward by the machine. I can tinker with the idle_timeout value to see if that buys me anything.
Short of that, my best guess is that I may have to a) spin up a 32GB beast of an instance on AWS and see if it can churn through the work better than my laptop or b) keep my laptop responsive enough that it doesn’t kill the Vault agent process and I can attempt to get a current list of certs which could then be revoked manually. But I’m not sure if manual revocation will clear things out of the crl.
As always, and insight or advice would be welcome.