Recent Vault releases include a new Raft storage backend, which supports HA deployments and is officially supported by Hashicorp.
Is it time to change the reference architecture to use this backend as the preferred one for clustered deployments ? My understanding is that we can achieve the same benefits of the Consul-backed deployment without the extra burden of an additional cluster to deploy and manage.
Is there a scenario were a Consul-backed deployment would still be a better choice ?
I think this is a great question. It’s something I’ve been wondering about myself since it was announced. I think it really depends on multiple factors, and the specific implementation per environment.
I think I would prefer to use Consul as my backend if my vault nodes were running as containers. Or maybe they are VM’s, and you use Consul to automate the clustering capability for Vault.
I would probably want to use the Vault internal backend if my vault nodes were longer running VM’s that were static and locked down and needed to ensure no other architecture dependencies. This reduces any impacts to Consul and keeps Vault separate from everything else as security concern.
That’s exactly what I was thinking: When would I want something different than that - at least for production environments ?
As I was discussing recently with people looking forward to adopt Vault, one should not underestimate how critical Vault becomes to your infrastructure once you’ve start using it, so one should take all steps to create very, very stable environment.
Anything short of that risks creating availability issues that are even more critical to an organization than the security issues Vault tries to solve.
Please note that I’m not advocating against using Consul as backend (or any other, btw). I’m just wondering whether Raft should be the new reference now that we have this option, unless there’s a very good reason not to do so.
Raft isn’t fully ready for production yet. In 1.2 it was released as a technical preview. This is subject to change but I believe the current plan is it will beta in 1.3 and be fully out in 1.4.
Is this still road mapped for being a full supported non-beta feature in 1.4?
I haven’t yet seen any discussion about performance/tuning of the raft backend, if we’re using raft consensus to ensure consistency and manage cluster leadership it seems like we should have some knobs exposed and some general guidance with regards to node counts and latency between nodes and how that will effect the time to consistency.
I’m looking forward to this; but would be nice to be able to game plan in advance.
I too am very much in the same boat. I would like to avoid deploying another cluster like Consul, etcd, etc… if Raft works well. In particular because i currently only have one AZ, so I am limited to node level HA. On the other hand, low latency seems perfect for Raft.
I was reading this article - a really good piece, BTW - and this caught my attention:
The large size is for production environments where there is a consistent high workload. That might be a large number of transactions, a large number of secrets, or a combination of the two.
It would be nice to define what a “consistent hight workload” is in more concrete terms. I’m quite confident that the answer is “it depends”, but a general rule of thumb here would be helpful. For instance, one should consider using the “large” instances when he/she has to sustain more than “X” KV secret requests/second.
I was also very interested in the numbers someone posted for another storage engine, would be great to see a blog post with the values, and best way to do performance test.
For now, I’m sticking with the officially supported ones (Consul, and ETCd). I’m looking forward to a 1.4 release version to start moving our Vault cluster to the Raft backend.