Hi all,
Last friday I had a weird, sort of understandable, but unfortunate situation of me breaking the production vault while importing a raft snapshot of the production vault into a vault test setup.
I am not entirely sure what happened. But I will try to explain, and I hope somebody can confirm/clarify whether what happened is expected and intended.
I have a production vault with 3 members.
I provisioned a fresh vault in the same network also with 3 members.
Maybe I should just initialized 1 node, and import the raft there, however i started with a fresh cluster with 3 nodes.
Both clusters live in the same network, e.g. there is no fw or routing blocking communication between the 2 clusters.
After the the production vault raft snapshot was imported into the vault test setup.
I noticed the old production raft membership configuration was visible in
$ vault operator raft list-peers
I deleted the respective members and rejoined the new test nodes
After the raft membership was reformed displaying the ips of the test nodes.
Unfortunately next morning I arrived at work it seemed the production cluster had a broken cluster state.
Some of the production follower nodes could no longer talk with the leader and we’re mentioning errors of raft communication not allowed for this node id, and expected another one.
I know that the TLS setup that is used for raft is only used during bootstrapping of the raft cluster. After this there should be an inner rekeying exchange by raft internally.
It feels like there have been re-exchange of keys with the production vault system after the raft import on the test cluster, and this broke the raft cluster state of the production cluster.
i have no complete explanation for this, but something like this must have happened.
can anybody confirm / enlighten me a bit more.
thanks!