Restoring raft snapshot on test cluster broke production cluster

Hi all,

Last friday I had a weird, sort of understandable, but unfortunate situation of me breaking the production vault while importing a raft snapshot of the production vault into a vault test setup.

I am not entirely sure what happened. But I will try to explain, and I hope somebody can confirm/clarify whether what happened is expected and intended.

I have a production vault with 3 members.

I provisioned a fresh vault in the same network also with 3 members.

Maybe I should just initialized 1 node, and import the raft there, however i started with a fresh cluster with 3 nodes.

Both clusters live in the same network, e.g. there is no fw or routing blocking communication between the 2 clusters.

After the the production vault raft snapshot was imported into the vault test setup.

I noticed the old production raft membership configuration was visible in

$ vault operator raft list-peers

I deleted the respective members and rejoined the new test nodes

After the raft membership was reformed displaying the ips of the test nodes.

Unfortunately next morning I arrived at work it seemed the production cluster had a broken cluster state.

Some of the production follower nodes could no longer talk with the leader and we’re mentioning errors of raft communication not allowed for this node id, and expected another one.

I know that the TLS setup that is used for raft is only used during bootstrapping of the raft cluster. After this there should be an inner rekeying exchange by raft internally.

It feels like there have been re-exchange of keys with the production vault system after the raft import on the test cluster, and this broke the raft cluster state of the production cluster.

i have no complete explanation for this, but something like this must have happened.

can anybody confirm / enlighten me a bit more.

thanks!

Your question interested me, as I’m likely to be running Raft in production and restoring test snapshots myself in the future.

I played around in a test environment trying to replicate what happened, but wasn’t able to do so.

However, it did remind me of some odd behaviour I have observed in the past:

The active node of a Vault cluster is recorded in data which is part of the Vault storage.

As a result, if you restore a backup into a new functioning cluster, which is still able to contact the previous cluster’s active node, I have observed the newly restored cluster erroneously forwarding requests back to the original cluster!

I am uncertain if this is actually what happened in your scenario, but it’s one possibility to consider.

1 Like

After a bit more experimentation, I did manage to reproduce the case I talked about above … but only for 15 seconds or so. As soon as the Raft heartbeat timer expires, a new election will trigger and will move the active node information to point to the right cluster. Still, this might explain what you saw, if

occurred just a few seconds after the snapshot restore.

Thanks for your reply and thanks for looking into it!

So I guess best to reproduce it:

Create two HA clusters

Create a snapshot of the first cluster

Import the snapshot on the second cluster and does it break the cluster state of the first cluster?

Will see i can find a moment again to reproduce it as well.