Attempt to modify Vault to multi-master spoiled by internal sync.Maps

Hi Friends!

My colleagues made interesting attempt to run Vault (general, opensource version) in multi-master mode, adding small amendments to it, so that whenever something is written to or removed from the internal cache or policy_cache - requests are sent to all other instances to erase corresponding key from these caches too. So for example if something is updated on one instance, the cached values (if any) for this piece of data are removed on other instances and if later read happens on them, they would fetch the data from underlying storage. The storage is the same for all of them (they use ETCD).

I was testing their solution and noticed it fails some simple test, like this:

  • using root token, create KV-secret, policy to read it and token with this policy on the instance 1
  • secret could not be read with this new token on the second instance unless the policy is read on this second instance firstly! (at the same time secret could be read, say, with the root token)

The culprit, upon some debugging, is the policyTypeMap inside the policyStore:

Upon GetPolicy (or rather switchedGetPolicy) attempt is made to fetch the “type” of the policy in question, but as it was not created or updated on this second instance, this auxiliary map doesn’t have corresponding key.

In other words, there is a cache for policies - and aside of it there is cache-like storage for policy types, which however, doesn’t fall back to searching the type in the storage in case of “cache miss” (I made a hasty fix my adding such ability).

Seemingly, caching was not supposed to be altered in the way my colleagues did (despite this seemed somewhat logical).

Now the question is - is it possible to easily get rid of such situations of “preserving internal state inside the instance besides two main caches”? I found with grep there are other places where various maps are used… And it seems I can’t be 100% sure the solution is bullet-proof against other possible cases of caches “falling out of sync” with those internal maps.

Any other advice and suggestions are welcome. The mates’ goal was to increase “request-per-second” capability of vault, i.e. scale it horizontally in that manner. (I know this could be achieved in enterprise version, but folks simply missed the time to request it to be included in financial plans for the upcoming year - hence they are trying to come up with some solution based on opensource version.)