Vault HA cluster w/Raft storage outage on ARMhf :: Out of memory

Caught a bit of a break. The two nodes mentioned above were no longer indicating they knew anything about my secrets or existing mounts, but the one whose sd card went berserk actually still had an intact copy of the vault and raft dbs. I just copied the entire data dir and config and moved it over to my laptop to see if I could untangle the mess.

On my laptop (a 6-core tiger lake processor with 16GB of RAM) spinning up the node started a massive revocation project which twice caused the machine to run out of memory and reboot itself.

After spending a whole evening churning on revocations I am now just getting a long string of errors which state the following:

[ERROR] expiration: failed to revoke lease: lease_id=pki_int/issue/agent-cluster/1vpybmzDNroHzYWdVDvSVSOm error="failed to revoke entry: resp: (*logical.Response)(nil) err: error encountered during CRL building: error storing CRL: put failed due to value being too large; got 1413409 bytes, max: 1048576 bytes"

A bit of Googling reveals what I already know and am attempting to remedy; the CRL is too full, I should stop storing leases, etc. But I’m not sure what the actual steps I have available to me are.

Running a tidy process with a 1s safety buffer against my pki_int mount while the agent is alive does not appear to have much effect. Additionally, there doesn’t appear to be a setting that I can find which would allow me to raise that maximum temporarily just to get me out of this jam.

At this point, every time I run the server agent, it boots up, and then attempts to expire all these certs but can’t. After a duration of my laptop being maxed out, I get the error message: http2: received GOAWAY [FrameHeader GOAWAY len=8], starting graceful shutdown, which appears to indicate that the agent reached an idle timeout threshold, which seems curious given the amount of effort being put forward by the machine. I can tinker with the idle_timeout value to see if that buys me anything.

Short of that, my best guess is that I may have to a) spin up a 32GB beast of an instance on AWS and see if it can churn through the work better than my laptop or b) keep my laptop responsive enough that it doesn’t kill the Vault agent process and I can attempt to get a current list of certs which could then be revoked manually. But I’m not sure if manual revocation will clear things out of the crl.

As always, and insight or advice would be welcome.

Quick update. I gave in and set up an EC2 with 8 cores and 32GB of RAM to crank away at the data dir I recovered from the aforementioned SD card. Spun up nicely, expired a few certs after I hit the PKI tidy endpoint, but then just endlessly output that it couldn’t revoke the certs due to the CRL being too large. …like for the last two days.

After a bit of reading it looks like pumping the CRL rotation endpoint will flush out the certs that don’t get properly removed due to resource constraints. It does not seem to open up too much space each time it’s run, or at least it’s overfilled again quickly. I’m using the number / interval of sustained, successful revoked lease messages the server outputs as my metric. Rotating the CRL once gave me a few, but then they subsided and were overtaken by the flood of “your CRL hates you” errors. After putting a curl request to rotate the CRL in my crontab to run every minute and that finally seems to be generating a good stream of revoked lease messages.

There is probably a better way to be going about this, but how to got about getting unstuck from this specific corner-case is not super well documented. The few google hits that I’ve found have gone to explain how the cluster got in disarray, and how to prevent getting into trouble next time, but not a whole lot about how to dig out of the trouble I’m in. I think this would make a good article for the learn guides or whatever the appropriate place is.

Not out of the woods yet, but I at least have enough horsepower to throw at my vault db such that I’ve been able to save a snapshot of the node (such as it is), export a copy of my secrets in case I have to throw up my hands and start again, and experiment with a few strategies for draining the CRL. With any luck, I’ll be able to fully drain the db out and return the cleaner data-dir back to the pi, reset the remaining cluster nodes, and have them rejoin the network with new non-lease-generating certificate configurations.

Hopefully, this is helpful to someone else. Would love to get more clarity on whether there’s a better path I could have taken to remediate.

2 Likes