Network latency requirement for Vault with Raft

As per documentation, Vault requires lower than 8ms of network latency between Vault nodes but if that is not possible for a Vault HA cluster spanned across two zones/DCs.

What are the implications or things will need to be considered if say latency between zones is ~18ms?
Is that even recommended deployment model at all?

That’s what performance replicators are for. They extend your infrastructure and reduce latency, but it is a separate cluster - that shares your engines – all writes still end up going back to the primary cluster but reads, leases, etc all are handled locally at the PR.

Right but that is an enterprise feature, i should have mentioned that OSS is in scope here.

Then I doubt you can extend the cluster that far.

That is what I am trying to understand if latency cannot be reduced then will the cluster be operational at all? Obviously reads/writes will be impacted but what are the other internal functionalities of Vault will be impacted?

Not only read and writes are impacted, but also the raft stability. Raft requires less than 8ms latency to be stable.

If Raft is not stable then entire Vault won’t be.

I disagree that it’s right to describe this as a stability issue.

I’m not sure where someone got that 8ms figure from, but out of the box, Raft is operating with a leader lease timeout of 2500ms and heartbeat election timeouts of 5000ms - source: Server Performance | Consul by HashiCorp

(The same library is used in Vault and Consul)

As 8ms is much smaller than these timeouts, Raft should cope fine with latency larger than 8ms… but the increased latency is likely to reduce the achievable write throughput to the data store.

This could be anything from completely fine, in a fairly quiet Vault cluster, to a major performance problem for a Vault cluster that needs to serve busy traffic - but it shouldn’t cause the cluster to fail to operate at all.

The details are in the ‘write throughput’.

a lot of activities on vault are considered as write operation, meaning some change to the storage. eg, a login request is a write request as a token will be generated. a dynamic request is also a write request as a lease will be generated.

with this in mind, i guess the latency requirement is more justified.