Currently, I have vault v1.0.3 configured with KV1 secret engine. We are planning to upgrade to latest vault, v1.8.x.
There are few things that I would like your input, suggestion on the following:
- (Knowing that skip upgrade works as per vault forum/discuss) Should I do skip upgrade from v1.0.3 to v1.8.x and keep KV1 secret engine? I don’t know if this possible.
- Or should I convert KV1 to KV2 on v1.0.3, test it out, and then upgrade to vault v1.8.x?
Also for upgrading KV1 to KV2, as per this how-to, I was thinking the following steps:
- Set up maintenance window
- To prepare for minimal downtime, duplicates all the existing ACL Rules (This is due the path different between KV1 and KV2 as per documentation above).
- Issue command to enable kv-2.
- Test it out (CLI and from the application that using it) using a new KV2 path.
- Once everything is OK, delete the old KV1 ACL Rules.
Let me know if the above is “good enough” to either the upgrade to v1.8.x and migrate to KV2.
Your storage backend is probably more important during this upgrade than vault’s upgrade.
!!Backup your data!! COPY the data to a remote location!
For vault – what you can do is join new Vault instances at 1.8 to your cluster, then step-down the old one and have the 1.8 become leader … and shutdown your old instance.
If you’re upgrading your backend and vault then I would just setup a whole new instance of 1.8+storage … With that you can do a storage migration from the old instance to the new instance. Test the new and if everything works then shutdown the old one.
kv → kv2 is simply and does not need to be involved in your “upgrade”. It can be done at any time (I’d suggest waiting until your 1.8 instance is up and running and do a ‘enable-versioning’, and that’ll change it to kv-v2.
Thanks @aram for the input. Really appreciate that.
Currently, I have 3 Pods with vault v1.0.3. Based on your comment, if I want to upgrade vault to v1.8.x:
- Spin up 1 Pod with vault v1.8.
- Join that newly created vault v1.8 Pod to the pool as worker.
- Step down the old vault v1.0.3 Pods.
- From the above step, newly created vault v1.8 Pod will become a leader.
- Create 2 additional vault v1.8 Pods that will be workers.
Let me know if the above steps are incorrect.
For kv → kv2 since the new
data/ path is introduced, I still have to duplicate all my ACL Rules by adding
data/ path, right?
Don’t forget about your storage. That could be a much more complicated process - depending on what it is.
Yes, the steps of the vault upgrade are correct.
After you add ‘enable-versioning’ then yes, your policies do need replace 'kv/ with ‘kv/data/’ AND ‘kv/metadata/’ .
Not to be pedantic, but just to get the terminology right … they names are “leader” and “standby”, rather than “leader” and “worker”.
For backend storage, I am using Multi-regional GCP bucket.
@aram could you elaborate a bit on the possibility of complicated process on the storage?
I am using OSS vault. I don’t think OSS vault support Replication (Leader/Follower). When you mentioned Leader/Follower, were you referring to OSS or Enterprise Vault?
I’m going to assume you mean GCS (Google Cloud Storage)? In that case nevermind on the storage upgrades. I had assumed it was consul or a database since that’s the out of the box recommendation from Hashicorp.
As far as the leader/standby, I think you’re confusing raft vs. replication. You always want to have an n+1 number of vault instances available, one is the leader node, the rest are standby nodes. The standby nodes act as read-only caches and can answer most requests from client nodes, the leader node is responsible for any write-operations, storage updates, raft status, etc. As far as I understand it this is included in the OSS license. (minimum recommended is 3 nodes, 5 is better)
DR = Disaster recovery is an Enterprise feature, used to fail over a primary instance to a DR site. It’s a cold-standby and does replicate the data, however it isn’t active and cannot respond to any queries.
Replication = as in performance replicator is also an Enterprise feature and requires a separate instance license. These are again read-only nodes that can also manage their own client leases, and only need to go to the primary instance when an update happens. It reduces the load on the primary instance+leader node, shortens the latency, provides a locally cached set of the secrets.