Replacing Consul Backend with Raft

psevestre · September 15, 2019, 5:25pm

Hi all,

Recent Vault releases include a new Raft storage backend, which supports HA deployments and is officially supported by Hashicorp.

Is it time to change the reference architecture to use this backend as the preferred one for clustered deployments ? My understanding is that we can achieve the same benefits of the Consul-backed deployment without the extra burden of an additional cluster to deploy and manage.

Is there a scenario were a Consul-backed deployment would still be a better choice ?

mylesw42 · September 15, 2019, 6:54pm

I think this is a great question. It’s something I’ve been wondering about myself since it was announced. I think it really depends on multiple factors, and the specific implementation per environment.

I think I would prefer to use Consul as my backend if my vault nodes were running as containers. Or maybe they are VM’s, and you use Consul to automate the clustering capability for Vault.

I would probably want to use the Vault internal backend if my vault nodes were longer running VM’s that were static and locked down and needed to ensure no other architecture dependencies. This reduces any impacts to Consul and keeps Vault separate from everything else as security concern.

psevestre · September 15, 2019, 7:50pm

Hi, @mylesw42. Thanks for your answer !

That’s exactly what I was thinking: When would I want something different than that - at least for production environments ?

As I was discussing recently with people looking forward to adopt Vault, one should not underestimate how critical Vault becomes to your infrastructure once you’ve start using it, so one should take all steps to create very, very stable environment.

Anything short of that risks creating availability issues that are even more critical to an organization than the security issues Vault tries to solve.

Please note that I’m not advocating against using Consul as backend (or any other, btw). I’m just wondering whether Raft should be the new reference now that we have this option, unless there’s a very good reason not to do so.

jeff · September 15, 2019, 8:26pm

Raft isn’t fully ready for production yet. In 1.2 it was released as a technical preview. This is subject to change but I believe the current plan is it will beta in 1.3 and be fully out in 1.4.

psevestre · September 15, 2019, 8:46pm

Well, now this IS a good reason for not using Raft ;^)

There’s no mention to that in the official docs, apart from a quick mention in the 1.2.0 beta changelog notes.

Thanks for pointing that.

kalafut · September 16, 2019, 3:52pm

@psevestre Thanks for highlighting the docs omission. A PR to update this has been submitted: https://github.com/hashicorp/vault/pull/7478/files

drawks · December 31, 2019, 6:12pm

Is this still road mapped for being a full supported non-beta feature in 1.4?

I haven’t yet seen any discussion about performance/tuning of the raft backend, if we’re using raft consensus to ensure consistency and manage cluster leadership it seems like we should have some knobs exposed and some general guidance with regards to node counts and latency between nodes and how that will effect the time to consistency.

I’m looking forward to this; but would be nice to be able to game plan in advance.

webmutation · February 25, 2020, 6:27pm

There is now a reference document for this here

I too am very much in the same boat. I would like to avoid deploying another cluster like Consul, etcd, etc… if Raft works well. In particular because i currently only have one AZ, so I am limited to node level HA. On the other hand, low latency seems perfect for Raft.

psevestre · February 26, 2020, 12:02am

I was reading this article - a really good piece, BTW - and this caught my attention:

The large size is for production environments where there is a consistent high workload. That might be a large number of transactions, a large number of secrets, or a combination of the two.

It would be nice to define what a “consistent hight workload” is in more concrete terms. I’m quite confident that the answer is “it depends”, but a general rule of thumb here would be helpful. For instance, one should consider using the “large” instances when he/she has to sustain more than “X” KV secret requests/second.

webmutation · February 26, 2020, 1:43pm

Indeed, fully agree Phil.

I was also very interested in the numbers someone posted for another storage engine, would be great to see a blog post with the values, and best way to do performance test.

What storage engine did you end up choosing ?

psevestre · February 26, 2020, 2:03pm

For now, I’m sticking with the officially supported ones (Consul, and ETCd). I’m looking forward to a 1.4 release version to start moving our Vault cluster to the Raft backend.

Topic		Replies	Views
Why Consul if Raft storage could provide replication Vault	13	3704	July 2, 2020
What is the recommended storage backend for Vault? Vault	4	5534	June 26, 2023
Beginner question on getting started guide (Vault on EKS) Vault	2	360	October 19, 2021
Enterprise Vault with integrated storage (RAFT) for replication across data centers in different regions Vault	3	638	January 15, 2021
Raft backups in HA Storage Vault raft	2	1270	August 26, 2020

Replacing Consul Backend with Raft

Related topics