Vault HA cluster w/Raft storage outage on ARMhf :: Out of memory

I’ve been setting up a Hashi-stack on a cluster of 9 raspberry pis. Each Node has a Vault, Consul, or Nomad agent running in server mode, with the other two products in client mode. All have Consul-Template issuing Nomad and Consul TLS certs from Vault. Vault hasn’t been stable enough to switch to short-lived certs, so they are long-lived for the moment.

I have had my Vault cluster crash and burn previously, and I was just getting it back up and running. After having it run smoothly for about a week, this one crashed and burned as well.

The original error I got was around gRPC sync errors. In this case the error I would get is: Error initializing storage of type raft: cannot allocate memory

Looking at top to see system resource consumption, I never saw much RAM being consumed, but the 4 CPU cores were getting hit pretty hard.

As of this moment, I can’t get any of the Vault nodes running, which also means the Nomad and Consul nodes aren’t really able to do much either as they rely on Vault for fresh certs for the agents themselves and the needs of Nomad jobs.

One thing consistent I’ve noticed between the previous broken cluster and this one is that the vault.db is ~1GB in size. I’m not well versed in all of what gets stored in the vault.db file, but I hadn’t migrated any secrets back in to the new setup. The only thing this Vault service was set up to do in it’s short life was generate the aforementioned TLS certs. It’s possible that a lot of unrevoked, 24hr certs for the various agents and jobs are what is filling up the vault.db file, but other than that, I have no idea what would make it that large. And I only had 6 small services running that have had no issue running independently of the HashiStack on these rpis.

My ulimit -l value is unlimited, so I don’t think it’s a lack of enough locked memory being available, giving exception for the fact that the Raspberry Pi 4’s only have 4GB of RAM, which is the stated minimum for running Vault.

What’s unclear is whether that minimum goes up when using local raft storage rather than Consul, where those burdens might otherwise be shifted to the Consul cluster.

Without getting too far ahead of myself in terms of what could be happening here, would using an external storage plugin like Dynamo alleviate some of the RAM requirements for the node?

Any thoughts on what I could try to get the cluster back up and running?

Thanks in advance,
Sam

Hi dehuszar,

Cool setup, I’m doing something similar at home, though I’m still in the process of incorporating Vault.

I haven’t played with raft integrated storage too much yet, but I understand that the boltdb files used for local storage can grow pretty large even without there being much data stored in Vault. You can see what’s actually there by enabling sys/raw and doing list/read queries against it.

The boltdb file is mmap’d, so conceivably that could be related to your “cannot allocated memory” error, though if it’s a 1GB file and you have 4GB that seems like a stretch.

Yes, an external storage engine would no doubt reduce your memory usage. What you’re doing isn’t so outlandish though, so if you’re willing I’d be happy to work with you to try to figure out what’s going wrong with raft storage. You may have to re-initialize your cluster if you can’t get it running as is though. Once you have a working cluster again, we could use the pprof endpoint to look at where your memory is being used over time.

If you’re not already taking regular raft snapshots, that would probably be a good idea too.

1 Like

I would be happy to have any assistance. Accessing sys/raw seems like something I could do in recovery mode. Is that my first step?

What info would be helpful to post here?

sys/raw is automatically enabled for you in recovery mode, yes, but it sounds like you can’t actually start Vault at present, and I don’t imagine that would change with recovery mode. Can you copy the data to another system with more memory and start Vault there?

Can do. What’s my first step once I’m in recovery mode? I’ve done this once before, a month or so ago… I could see a bunch of entries, but I wasn’t sure how to understand what they were. Only serial numbers of items stored.

Well you don’t really need recovery mode. The main purpose of recovery mode is so that if Vault won’t start, you can delete or edit problematic storage entries to fix that, like people have done in the past in Consul directly when using the Consul storage backend. In your case, once you bring it up elsewhere with more memory, it should come up normally, regardless of whether or not you’re in recovery mode.

So I suggest you start without recovery mode, adding the config option to enable sys/raw (link in my first post). That should enable you to take a raft snapshot. You could then try moving aside your existing data dir on your pi, starting a new Vault instance, and restoring the snapshot. Conceivably what’s happening is that garbage has accumulated in the boltdb file, making it too big to memory-map, so starting from a clean slate and restoring the snapshot may allow it to start up normally.

As to exploring the data with sys/raw: we want to figure out what live data you have. It’s almost certainly going to live within your mounts and/or within the expiration system. So I would try e.g.

# find out how many entries are in the lease expiration store
$ vault list sys/raw/sys/expire/id

# get the /logical/uuid mount of your pki engine
vault read sys/raw/core/mounts |jq 

# see where the pki data is and how much of it there is
vault list sys/raw/logical/yourPKImountUUID/revoked
vault list sys/raw/logical/yourPKImountUUID/crl
vault list sys/raw/logical/yourPKImountUUID/certs

I have copied out the vault data and config to my laptop and am trying to adjust the config so that it can run locally. The node still wants to point to the original ip no matter what VAULT_… vars I set or -address flags I add to the commands.

Any advice on how to reset any expectations for the agent about which nodes are out there, and what it’s current address is?

Hmm, I haven’t played with Raft enough to run into this problem, though I’m not terribly surprised based on other things I’ve seen in the code. Does it fail to startup? What’s the actual error?

No, it starts find, it’s just trying to reach out to other nodes from the cluster, and when I use the cli commands, attempts to talk to the previous ip for the node whose data I am using. Address overrides won’t change the ip address it tries to resolve to and I get a permission denied against that IP which doesn’t have an active vault agent running.

If I try to add the disable_clustering attribute to my config, then I get a segfault:

panic: runtime error: invalid memory address or nil pointer dereference

Maybe this is something for an issue on github. :wink:

1 Like

That panic should be fixed in 1.4.1, see https://github.com/hashicorp/vault/pull/8784.

I am using 1.4.1 sadly

You’re right. Taking it over here if anyone else wants to follow along. I’m probably going to just make backups of the all the raft databases and re-init using the DynamoDB storage backend. But it would be cool to learn how to get in there and root around. I’m sure it is a critical skill to have when I have nodes in production