I’ve been setting up a Hashi-stack on a cluster of 9 raspberry pis. Each Node has a Vault, Consul, or Nomad agent running in server mode, with the other two products in client mode. All have Consul-Template issuing Nomad and Consul TLS certs from Vault. Vault hasn’t been stable enough to switch to short-lived certs, so they are long-lived for the moment.
I have had my Vault cluster crash and burn previously, and I was just getting it back up and running. After having it run smoothly for about a week, this one crashed and burned as well.
The original error I got was around gRPC sync errors. In this case the error I would get is: Error initializing storage of type raft: cannot allocate memory
Looking at top
to see system resource consumption, I never saw much RAM being consumed, but the 4 CPU cores were getting hit pretty hard.
As of this moment, I can’t get any of the Vault nodes running, which also means the Nomad and Consul nodes aren’t really able to do much either as they rely on Vault for fresh certs for the agents themselves and the needs of Nomad jobs.
One thing consistent I’ve noticed between the previous broken cluster and this one is that the vault.db is ~1GB in size. I’m not well versed in all of what gets stored in the vault.db file, but I hadn’t migrated any secrets back in to the new setup. The only thing this Vault service was set up to do in it’s short life was generate the aforementioned TLS certs. It’s possible that a lot of unrevoked, 24hr certs for the various agents and jobs are what is filling up the vault.db file, but other than that, I have no idea what would make it that large. And I only had 6 small services running that have had no issue running independently of the HashiStack on these rpis.
My ulimit -l
value is unlimited
, so I don’t think it’s a lack of enough locked memory being available, giving exception for the fact that the Raspberry Pi 4’s only have 4GB of RAM, which is the stated minimum for running Vault.
What’s unclear is whether that minimum goes up when using local raft storage rather than Consul, where those burdens might otherwise be shifted to the Consul cluster.
Without getting too far ahead of myself in terms of what could be happening here, would using an external storage plugin like Dynamo alleviate some of the RAM requirements for the node?
Any thoughts on what I could try to get the cluster back up and running?
Thanks in advance,
Sam