I may have misunderstood how recovery works with raft storage.
Here is my setup:
- vault version: 1.4.0 (also tried with 1.4.1)
- 3 node HA vault cluster, all nodes unsealed
- storage is raft integrated storage (recently migrated from etcd)
- several snapshots have been taken
issue: if I reboot all nodes simultaneously, I can’t get my cluster back into a working state. The only option available to each node is to either create a new raft cluster, or join an existing raft cluster.
Initialising a new raft cluster, creates new keys, and it seems I therefore cant use my snapshots. There is also no existing raft cluster to join, since I intentionally took it down.
how does someone recover from this kind of failure?
An update… So the issue I had wasn’t raft related, but rather docker-swarm related. I had been using local volume mounts in my docker-compose file, which are not persisted when the stack is taken down. My fix was to use a bind mount from the host machine to the container with the correct uid/gid owner on that folder.
After I got this working, I tested a number of DR scenarios. It seems the easiest way to recover from a full scale cluster failure, is by seeding a new cluster using the output of a
vault operator migrate run.
A bit more information, since it took me a while to figure this out from various github issues and documents. (main github thread is here: https://github.com/hashicorp/vault/issues/5683)
As of Vault 1.4.1, the only way to get a consistent backup of vault (using raft integrated storage) is via the
migrate command. My steps are:
- Take down a vault node in the cluster.
- Run the
vault migrate command.
- note: if using source=raft, destination=s3, the s3 backup is uncompressed.
- Bring up the vault node and unseal it.
Using s3 has the nice bonus that you can point a new vault server directly to the s3 bucket you used for your backup. You could also migrate from the s3 bucket to another storage destination and seed a new cluster.
You can’t seed a new cluster using the snapshots from the
vault operator raft snapshot save command. But, I think you could use
snapshot in combination with
migrate. i.e., an occasional
migrate (e.g., whenever you rekey your cluster), followed by regular snapshots which don’t require taking a node down.
(Please correct me if anything I’ve said is wrong or not sensible).
You’ll force the snapshot to restore into the new cluster, then you can use your existing unseal keys. This might help someone who comes across this in the future: Backup - Restore
I got round to testing this today (using a snapshot from Vault 1.4.1 and a new cluster using Vault 1.6.1). My steps were:
- install a fresh copy of vault on a new machine (I used HA raft storage).
vault operator init (with the same number of key-shares/key-threshold as the snapshot).
vault operator unseal, to fully unseal vault using newly generated keys.
vault login <root-token>, using newly generated token.
vault operator raft snapshot restore --force mysnapshot.tar.gz
vault operator unseal, to fully unseal vault using the snapshot’s keys.
vault login, using any previous method that worked on the old vault.
This worked fine. The key for me was to ensure I did the
vault init with a matching key structure.