Concerning Inconsistency In Raft and Vault Storage Among All Nodes

In production I run a 5 node Vault cluster in Kubernetes using Vault’s raft integrated storage. Recently I ran into issues where there was leader fluctuation, and out of nowhere I see that one of the 5 nodes has hardly any data in its raft.db and vault.db. I have been trying to understand the implications of that.

In a test cluster, I have recreated this behavior where the original leader has more data in its vault.db and raft.db than the followers. But, what’s confusing is when I step that leader down and a follower with less data becomes leader, I still see all of the data that the original leader had in the Vault UI. Yet, I see very clearly on the node that it has less data.

Questions:

  1. How can this be that followers have less data than the leader in vault.db yet they can still serve fully accurate data to the Web UI when they become leader? ie- it seems like they would only have a fraction of the data given their vault.db is a fraction of the size of the original leader
  2. Is the expectation that the leader and all followers have the same amount of data in both vault.db and raft.db?

Thanks for any help you can provide-- this is truly perplexing and I can’t tell if it’s a critical production issue.

1 Like

Hi! I saw you posted this on Gitter too. Interesting question. Hashi support can probably provide a more thorough answer, but I wonder whether what you’re seeing is just Raft’s efforts to efficiently use storage space:

Obviously, it would be undesirable to allow a replicated log to grow in an unbounded fashion. Raft provides a mechanism by which the current state is snapshotted and the log is compacted.

From: Integrated Storage | Vault | HashiCorp Developer

I don’t pretend to understand all the workings of Raft, though!

1 Like

jlj7 might be right, but at a guess, what you’re seeing is due to the fact that the raft data is stored in BoltDB data files, which are prone to containing a lot of “garbage”. That is, a 100MB bolt file might contain only 10MB of active data, and 90MB of unused data that may be overwritten in future. It doesn’t aggressively free up that space because it would require expensive disk operations - better to waste a bit of disk space than provide slow performance.

2 Likes