Vault HA cluster w/Raft storage outage on ARMhf :: Out of memory

I’ve been setting up a Hashi-stack on a cluster of 9 raspberry pis. Each Node has a Vault, Consul, or Nomad agent running in server mode, with the other two products in client mode. All have Consul-Template issuing Nomad and Consul TLS certs from Vault. Vault hasn’t been stable enough to switch to short-lived certs, so they are long-lived for the moment.

I have had my Vault cluster crash and burn previously, and I was just getting it back up and running. After having it run smoothly for about a week, this one crashed and burned as well.

The original error I got was around gRPC sync errors. In this case the error I would get is: Error initializing storage of type raft: cannot allocate memory

Looking at top to see system resource consumption, I never saw much RAM being consumed, but the 4 CPU cores were getting hit pretty hard.

As of this moment, I can’t get any of the Vault nodes running, which also means the Nomad and Consul nodes aren’t really able to do much either as they rely on Vault for fresh certs for the agents themselves and the needs of Nomad jobs.

One thing consistent I’ve noticed between the previous broken cluster and this one is that the vault.db is ~1GB in size. I’m not well versed in all of what gets stored in the vault.db file, but I hadn’t migrated any secrets back in to the new setup. The only thing this Vault service was set up to do in it’s short life was generate the aforementioned TLS certs. It’s possible that a lot of unrevoked, 24hr certs for the various agents and jobs are what is filling up the vault.db file, but other than that, I have no idea what would make it that large. And I only had 6 small services running that have had no issue running independently of the HashiStack on these rpis.

My ulimit -l value is unlimited, so I don’t think it’s a lack of enough locked memory being available, giving exception for the fact that the Raspberry Pi 4’s only have 4GB of RAM, which is the stated minimum for running Vault.

What’s unclear is whether that minimum goes up when using local raft storage rather than Consul, where those burdens might otherwise be shifted to the Consul cluster.

Without getting too far ahead of myself in terms of what could be happening here, would using an external storage plugin like Dynamo alleviate some of the RAM requirements for the node?

Any thoughts on what I could try to get the cluster back up and running?

Thanks in advance,
Sam

Hi dehuszar,

Cool setup, I’m doing something similar at home, though I’m still in the process of incorporating Vault.

I haven’t played with raft integrated storage too much yet, but I understand that the boltdb files used for local storage can grow pretty large even without there being much data stored in Vault. You can see what’s actually there by enabling sys/raw and doing list/read queries against it.

The boltdb file is mmap’d, so conceivably that could be related to your “cannot allocated memory” error, though if it’s a 1GB file and you have 4GB that seems like a stretch.

Yes, an external storage engine would no doubt reduce your memory usage. What you’re doing isn’t so outlandish though, so if you’re willing I’d be happy to work with you to try to figure out what’s going wrong with raft storage. You may have to re-initialize your cluster if you can’t get it running as is though. Once you have a working cluster again, we could use the pprof endpoint to look at where your memory is being used over time.

If you’re not already taking regular raft snapshots, that would probably be a good idea too.

1 Like

I would be happy to have any assistance. Accessing sys/raw seems like something I could do in recovery mode. Is that my first step?

What info would be helpful to post here?

sys/raw is automatically enabled for you in recovery mode, yes, but it sounds like you can’t actually start Vault at present, and I don’t imagine that would change with recovery mode. Can you copy the data to another system with more memory and start Vault there?

Can do. What’s my first step once I’m in recovery mode? I’ve done this once before, a month or so ago… I could see a bunch of entries, but I wasn’t sure how to understand what they were. Only serial numbers of items stored.

Well you don’t really need recovery mode. The main purpose of recovery mode is so that if Vault won’t start, you can delete or edit problematic storage entries to fix that, like people have done in the past in Consul directly when using the Consul storage backend. In your case, once you bring it up elsewhere with more memory, it should come up normally, regardless of whether or not you’re in recovery mode.

So I suggest you start without recovery mode, adding the config option to enable sys/raw (link in my first post). That should enable you to take a raft snapshot. You could then try moving aside your existing data dir on your pi, starting a new Vault instance, and restoring the snapshot. Conceivably what’s happening is that garbage has accumulated in the boltdb file, making it too big to memory-map, so starting from a clean slate and restoring the snapshot may allow it to start up normally.

As to exploring the data with sys/raw: we want to figure out what live data you have. It’s almost certainly going to live within your mounts and/or within the expiration system. So I would try e.g.

# find out how many entries are in the lease expiration store
$ vault list sys/raw/sys/expire/id

# get the /logical/uuid mount of your pki engine
vault read sys/raw/core/mounts |jq 

# see where the pki data is and how much of it there is
vault list sys/raw/logical/yourPKImountUUID/revoked
vault list sys/raw/logical/yourPKImountUUID/crl
vault list sys/raw/logical/yourPKImountUUID/certs

I have copied out the vault data and config to my laptop and am trying to adjust the config so that it can run locally. The node still wants to point to the original ip no matter what VAULT_… vars I set or -address flags I add to the commands.

Any advice on how to reset any expectations for the agent about which nodes are out there, and what it’s current address is?

Hmm, I haven’t played with Raft enough to run into this problem, though I’m not terribly surprised based on other things I’ve seen in the code. Does it fail to startup? What’s the actual error?

No, it starts find, it’s just trying to reach out to other nodes from the cluster, and when I use the cli commands, attempts to talk to the previous ip for the node whose data I am using. Address overrides won’t change the ip address it tries to resolve to and I get a permission denied against that IP which doesn’t have an active vault agent running.

If I try to add the disable_clustering attribute to my config, then I get a segfault:

panic: runtime error: invalid memory address or nil pointer dereference

Maybe this is something for an issue on github. :wink:

1 Like

That panic should be fixed in 1.4.1, see https://github.com/hashicorp/vault/pull/8784.

I am using 1.4.1 sadly

You’re right. Taking it over here if anyone else wants to follow along. I’m probably going to just make backups of the all the raft databases and re-init using the DynamoDB storage backend. But it would be cool to learn how to get in there and root around. I’m sure it is a critical skill to have when I have nodes in production

Hey. So. After a long while, it looks like I’ve run into this problem again. After the last round of this issue, my solution was just to buy new Raspberry Pis with 8GB of RAM. At this point, there’s not a 16GB version I can upgrade to and I should probably figure out how to resolve this issue.

What’s the best way to delete 70,000ish certificates? Ideally it would be great to get some information about them, understand where this explosion of certs came from (all my consul-templates are set to 24hr and I’ve run the tidy command) but my Raspberry Pis are barely hanging in there. I am unable to get the vault list sys/raw/logical/<id>/certs command to return anything as it blows past the context deadline.

Just getting the 1st 100 results in the Vault UI was a bit of work. I’m not seeing in the API docs or the cli help any way to even paginate results.

Anyone who’s had to work this problem and might have some advice, I’d be grateful.

After a little bit of investigating, I’m guessing that restarting the systemd job for consul-template on a schedule may have caused multiple listeners to live in memory, and multiplicate certificates to be issued.

Switched to a crontab config where I run the certificate generation configs for consul-template with the -once flag on that same schedule.

That still leaves me with the task of emptying out the certificate list. Having read certs from page 1 and page 7000-ish, they all appear to be from the last few days. The only thing that has changed in that time is having updated to Consul 1.9.0.

Should probably check the consul-template repo for updates.

I’m considering just turning off the pki_int mount and redeploying the mount config to wipe the cert list. Not sure if that will work.

Any thoughts / input / advice would be appreciated

Hi @dehuszar,

First off, unless you’re making use of a CRL, you should consider turning off lease generation for certs: PKI - Secrets Engines - HTTP API | Vault | HashiCorp Developer
That should prevent this problem from happening again in future.

Re the vault list timing out: note that you can set VAULT_CLIENT_TIMEOUT and TCP - Listeners - Configuration | Vault | HashiCorp Developer higher which should allow you to work around that.

I’m considering just turning off the pki_int mount and redeploying the mount config to wipe the cert list. Not sure if that will work.

That sounds like you’re hoping you can unmount the pki secrets engine to clean up the certs and leases. That will be slow, as Vault will try to revoke all the secrets in a mount before unmounting. It may not work given your memory issues, since you say the Pis are nearly falling over already. Still, it’s worth a shot, as the remaining options are going to be harder.

Two remaining options I can see if that fails:

(1) use recovery mode, which will start Vault in a way that you can access its storage, without it starting all of its subsystems. That may relieve the memory pressure enough that you can delete the leases using the sys/raw API, which will require some understanding of how leases are represented in storage.

(2) take a snapshot, restore it on a beefier machine with more memory, and do the unmount or revoke-prefix there, then take a snapshot and restore that on the Pi cluster.

But I’d try increasing the request/client timeouts before exploring these more extreme options.

This is great info, many thanks.

Not sure exactly what caused this. I think I either misunderstood what the consul-template workflow should have been for pumping out certificate and key pairs for node agents or abused it, but due to the cert and key needing to be output in the same pass before the agent config gets reloaded, I was using systemd to keep consul-template alive and the

exec {
  command = "systemctl reload consul"
}

to reload the configuration after render.

This usually causes the consul-template process to exit upon completion, and a cron job restarts it at the designated time. But after having this setup run just fine for several months now (since at least July) I have only had this problem happen just now, and from the brief sampling I’ve taken of the certificates generated across the many thousand pages of certificates,they are all certificates which were created on December 1st and expire on the 31st (30 days are my longest ttl for most everything in the system now that I’ve gotten consul-template to “work”).

Not sure what the one thing was (I’m guessing some kind of consul-template loop got formed somehow) but I’ve been doing this in my spare time and haven’t gotten metrics wired up to prometheus/grafana/alertsmanager just yet. So that’s probably my next move once I’ve made it out of this particular hole. I think just disabling the systemd service config and using the cron job to just pump out templates at the appropriate time via the -once flag should prevent any future cert explosions --if my hypothesis is correct, anyway.

Unfortunately, I don’t have 3 beefier machines to run the cluster on, so my hope is that I figure out where my error was, remediate it, and with my limited needs in my homelab the 8GB pis should keep me going for a little while longer.

I did increase the http_idle_timeout and http_read_timeout. I’ll do the client timeout next. That sound like just the thing.

My goal for remediation was to just buy myself enough bandwidth to revoke a batch of 5-10k-ish certificate serial numbers (I’ve gathered the serial numbers in use by the clusters as an exceptions list) and run the tidy command and see if that gives me enough of an opening to remove the rest.

If the client timeout setting doesn’t do it, I’ll try recovery mode and try turning off the CRL. I just set it up because the learn guides told me to. I don’t fully appreciate what I would want it for just yet, so I’ll do some reading in the meantime.

Again, I really appreciate your thoughts on the matter. Many many thanks!

Is it possible to revoke via prefixes using partial serial numbers, or does it only work with full path segments; i.e. vault lease revoke -prefix pki_int/30-. I ran this on several serial numbers that start on two-digit combinations which don’t intersect with my current cluster’s cert serials and it didn’t throw an error (returns: All revocation operations queued successfully!), but I’m still seeing the certs there and it doesn’t look like my cert list page count has changed much.

Expounding on this a little further, I’m not sure that I’m getting the paths right in my cli commands or if the memory overages are clouding the outcome, but if I revoke a read a cert like so:

vault read pki_int/cert/05-87-be-5c-c5-16-37-60-72-8d-1d-4b-28-2c-66-9e-8e-f4-bd-34

I get back a cert with a 0 as it’s revocation time.
I then try to revoke it with:

vault lease revoke pki_int/05-87-be-5c-c5-16-37-60-72-8d-1d-4b-28-2c-66-9e-8e-f4-bd-34

(or pki_int/cert/[…], or secret/pki_int/cert/[…], and a few other variations to boot) and re-reading the cert still shows the revocation time as 0 in spite of getting the expected “All revocation operations queued successfully!”. Going to the UI and revoking it… no problem. Next read from the cli and I get a revocation time stamp.

What path is expected here for a pki lease?

Going to try and mirror the REST request the UI is doing and do it via curl, but it seems like this could be better documented (or I totally missed something).

A few updates after having attempted to chip away at this from a few different directions:

  • The cluster is unable to enter recovery mode. Not sure what the blocker here is, but I am using the AWS Auto-Unseal, and when I get to the step where I have to unseal to get my recovery token, once all three keys have been entered, after some delay I get an Error posting unseal key: context deadline exceeded back. I’ve tried this on two different nodes and gotten the same results
  • I am unable to create a recovery snapshot. It just hangs silently forever. Never times out or anything.
  • Whatever is happening caused so much disk thrashing that one of my Vault server node pi’s sd card basically melted down and would start throwing IO errors. Luckily I was able to get all the important bits off of that sufficient to spin up a new version of the node using a fresh sd card, but I can’t get it to join the cluster. It returns with the error * failed to join raft cluster: failed to join any raft leader node. Of course, if I do a vault operator raft list-peers the other nodes see that the new node did join, but the joining node is not of the same opinion, and any attempts to unseal it returns errors that it has not yet been initialized.
  • Having left recovery mode on the remaining nodes, neither see each other as peers, though they both claim to be an active node on the same cluster id. Attempts to manually join the other return joined: true, but no change to the peer list. The nodes’ committed and applied index numbers are quite a bit off from each other.
  • Trying to see if I can at least read from the kv and save the trees all to json or something so I can start over again, but so far it’s not looking great.

Would love any further advice or comment