After some critical failures and other fun (described in other posts that nobody seems to reply to) I decided alright, I got one node up and running using a filthy trick that I won’t repeat here, I took a snapshot, and I figured I can just spin up an entirely new cluster from said snapshot.
The snapshot is 29Gb large. When you try to restore it, you get an error “request body could not be read”. Okay, fine. Google suggested increasing vault’s client timeout. Done, same error. Okay maybe it tries to dump the entire 29Gb in one single sitting to the vault server, so let’s unlimit the max request size to Vault while we’re at it.
That solved the request body error, but now I just get “broken pipe” every time. I am restoring on the machine the vault server is on, so I don’t see how networking can be an issue when you’re sending to 127.0.0.1.
This kind of pisses me off because I’d assume that Vault’s recommended way of doing backups and restores will work, but it’s obviously somehow broken if it can’t be used to actually restore a backup. Does anyone have any ideas? Any at all, don’t be shy…
What does resource consumption look like on the server running the restore?
I’m wondering if it runs out of RAM ?
Also are you using the vault cli to do the restore or curl to the REST API?
Forgot I made this post, the problem got solved by increasing the http_read_timeout in the Vault listener block, not entirely obvious given the error you see (would probably be better if it responds with an error message that indicates the request took longer than http_read_timeout so you have a vague idea of where to start looking).
To answer your question mark, resource consumption goes through the roof for memory. I resorted to using Curl because it gives me a little more information (-v) than the vault cli does during the request. On a 64Gb machine, memory consumption goes all the way up and then vault itself gets shot in the head by the OOM killer. You can avoid that by adding at least 64Gb of swap space, but then the process takes, well, a very long time.
On a 128Gb machine you’ll be using that, plus a few Gb of swap space to restore the snapshot. Once all is said and done it does work, but vault’s configuration “out of the box” doesn’t allow for snapshots over a few Gb to be restored; and it seems you need at least 4 times the RAM of the snapshot’s size in order to restore it. It seems vault will read the entire thing into memory, then writes it to disk and lets the FSM do it’s thing; for large uploads, reading into memory is considered “bad juju”, it should realistically start streaming to disk at some point so you don’t need godawful amounts of memory (or swap).
Glad to hear you got it sorted.
Largest snapshot I work with is 3GB so you are in another league!
I blame edge devices Also our Vault is at least 5 years old, so it’s picked up some baggage along the way. From a real-world use perspective, we have various approle mounts, a ton of PKI mounts (that don’t save leases, because it’d be even bigger if it did), and a lot of database mounts with a lot of roles.
We have tuned things to use as short of a lease time as we can get away with, but… it still ends up being pretty big. We also use approles to get tokens for the vault agents we run on our servers, and we’ve got a fair number of those. Taken all into account it tends to grow and accumulate a fair bit. We did have an issue where a few servers were absolutely hammering vault for a week or two (during my vacation where nobody bothered looking at the audit log and I hadn’t quite gotten my round tuits out to set up a rate alert) so that didn’t help either.
I made some issues on github for both my “grievances” (slightly more info in error messages, and perhaps streaming large uploads to disk instead of memory) so hopefully my (temporary) misery will be to someone’s benefit in the future
1 Like