Vault backend migration from Consul to Raft (K8S, Helm, KMS auto-unseal)

Hi everyone.

Currently we have a Kubernetes cluster running on AWS EKS, with a Vault OSS cluster running on it, using Consul as storage backend. Vault was installed and it’s managed with Helm chart, and auto-unsealed with AWS KMS.

Now we need to migrate the Consul backend to Integrated Storage with Raft. As far as I know, there is no specific documentation about this procedure and honestly I have hundreds of doubts.

Has anyone done this before that can help? How can I migrate the backend and then reconcile the Helm configuration?

I have not done this before, but your question is interesting to me, as I think I might need to do something similar in the future…

As you have noticed, whilst Helm & Kubernetes make the initial deployment quite easy, complex operations thereafter can be made harder…

First, I tried helm upgrade from a Consul-storage deployment to a Raft-storage deployment, just to see what would happen…

Error: UPGRADE FAILED: cannot patch "vault" with kind StatefulSet: StatefulSet.apps "vault" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden

It seems the problem is the Helm chart wants to change volumeClaimTemplates so that PVCs are generated for the Vault StatefulSet … which makes sense … but also isn’t supported by Kubernetes.

Oh well… Vault storage migration requires downtime anyway, so the fact we have to delete the StatefulSet to replace it isn’t really making things worse.

Where things start getting trickier, is that we need somewhere to run the storage migration… meaning we need somewhere with access to the mounted Persistent Volume, whilst Vault is not running…

Trying to put all these contraints together, I came up with the following rough draft of a migration plan…

But before that - in the middle of the procedure, we’re going to need a way to actually run the migration, and when we do, we need:

  • Access to the Vault CLI binary
  • Access to the data volume
  • Vault server to NOT be running

I can’t see any way to make that happen using the existing server pods, since if you kill the server process, the pod will terminate.

That means we need to make our own “maintenance” pod definition, and if we’re using Helm anyway, we might as well create the “maintenance” pod using it too.

So… make sure you’re using a local copy of the Vault Helm chart so you can easily make modifications, and copy the templates/server-statefulset.yaml file to set up a new statefulset that will define our optional “maintenance” pod:

  • The metadata.name will need to be different to distinguish it, as will the spec.serviceName (add a suffix -maint?)
  • component: server will need to change to something like component: maintenance to set it apart (both cases)
  • Various other optional parts of the YAML might be applicable only to the running servers and not a maintenance pod, depending on what you have configured
  • The readinessProbe, livenessProbe, lifecycle, and template rendering the volumeClaimTemplates are not wanted for a maintenance pod
  • But we need to add in an explicit mention of the volume we want to mount instead to the volumes section:
        - name: data
          persistentVolumeClaim:
            claimName: data-vault-0
  • As well as deleting the args and changing the command so we run a dummy command instead of starting a real Vault server:
          command:
          - /usr/local/bin/docker-entrypoint.sh
          - sleep
          - 999d
  • And we’ll set spec.replicas to 0 so that we only have a maintenance pod when we manually scale up this StatefulSet.

With all of that prepared …

  1. Schedule planned downtime in advance
  2. Scale the Vault StatefulSet to zero replicas (the Vault service is now offline)
  3. Consider taking a backup just to be safe… although we’ll be leaving the old Consul pods in existence so in a way they are a backup themselves.
  4. Manually kubectl delete the StatefulSet, since we need to replace it
  5. helm upgrade the Vault chart to values that specify Raft storage
  6. Scale the new Vault StatefulSet to zero replicas, because once it has initialised the volumes, we need Vault not running to do the storage migration
  7. Scale the maintenance StatefulSet to 1
  8. kubectl exec -it podname -- sh into the maintenance pod
  9. In the maintenance pod interactive session, create a configuration file for vault operator migrate, and run the migration… but maybe before you start, wipe the initial contents of /vault/data/ created when the server pod initially started up and created a new initial database?
  10. Scale the maintenance StatefulSet to 0, and the main StatefulSet back to your desired number of replicas
  11. Depending on the details of your configuration, it’s possible all your replicas find each other and replicate the migrated data to the other nodes, or perhaps some executions of vault operator raft join are needed.

Not at all tested in full! I was pretty far into “thought experiment” territory by the end of typing all that. But, hopefully it’s a decent source of inspiration if you want to work through productionising something based on this.

It occurs to me that since volumeClaimTemplates can’t be easily updated, resizing the data volume later will be a hassle. Better make sure to plan suitable sizing carefully.

1 Like

It’s interesting the way we’ve got to almost the same conclusions, except that you have a better idea about re-defining the statefulset completly.

I already have a lab environment where I will be able to test your suggestion. I’ll keep you posted about the results.

Thanks!

Given how much time has elapsed since your post, I doubt that this will be helpful to you, but for the benefit of anyone else facing the same problem, I’ll post about our approach.

We plan on using an initContainer within the StatefulSet’s Pod. The vault-helm Chart provides the ability to specify additional initContainers simply by setting the .Values.server.extraInitContainers value.

I suppose that could work, and it means you only have to write container YAML instead of pod YAML. However, it feels a bit of a forced abstraction to me, and means you could suffer unexpected startup of the main Vault container if your initContainer exits.

IMO creating a separate StatefulSet is nicer.

@maxb I’ve been working on a migration script today based on the ideas in this comment. I ran a dry-run on the helm chart and copied the server statefulset to a migrator.yaml file. And changed the name, and container args to something that will loop indefinitely. And hard-coded the data volume claim name. Here is pretty much what I used,

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: vault-migrator
spec:
  serviceName: vault-internal
  podManagementPolicy: Parallel
  replicas: 1
  updateStrategy:
    type: OnDelete
  selector:
    matchLabels:
      app.kubernetes.io/name: vault
      app.kubernetes.io/instance: vault-migrator
      component: server
  template
    metadata:
      labels:
        helm.sh/chart: vault-0.22.1
        app.kubernetes.io/name: vault
        app.kubernetes.io/instance: vault-migrator
        component: server
    spec:
      terminationGracePeriodSeconds: 10
      serviceAccountName: vault-migrator

      securityContext:
        runAsNvaultraftrue
        runAsGroup: 1000
        runAsUser: 100
        fsGroup: 1000
      hostNetwork: false

      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: data-vault-0
        - name: home
          emptyvaultraft     containers:
        - name: vault
          resources:
            limits:
              cpu: 1000m
              memory: 1Gi
            requests:
              cpu: 0
              memory: 0

          image: my.image/vault:1.12.0
          imagePullPolicy: IfNotPresent
          command:
          - "/bin/sh"
          - "-ec"
          args:
          - |
            while true; do sleep 5; done
          securityContext:
            allowPrivilegeEscalation: false
          env:
            - name: HOST_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: VAULT_K8S_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: VAULT_K8S_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: VAULT_ADDR
              value: "https://127.0.0.1:8200"
            - name: VAULT_API_ADDR
              value: "https://$(POD_IP):8200"
            - name: SKIP_CHOWN
              value: "true"
            - name: SKIP_SETCAP
              value: "true"
            - name: HOSTNAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: VAULT_CLUSTER_ADDR
              value: "https://$(HOSTNAME).vault-migrator-internal:8201"
            - vaultraftE
              value: "/home/vault"
          volumeMounts:
            - name: data
              mountPath: /vault/raft
            - name: home
              mountPath: /home/vault
          ports:
            - containerPort: 8200
              name: https
            - containerPort: 8201
              name: https-internal
            - containerPort: 8202
              name: https-rep
      imagePullSecrets:
        - name: registry-creds

The basic operations were just like yours. I installed Vault with raft enabled. Scaled to 0. Applied the vault-migrator sts with 1 replica. Copied in a migrate.hcl file. Exec’d in to run ‘vault operator migrate’. Then scaled the migrator sts to 0, and scaled the raft sts to 1. I exec’d into the raft container and ‘vault operator raft list-peers’ showed the host I was on.

The problem comes in when scale my vault instance to 2 and run ‘vault operator raft join ’. No matter what I add it never shows up in the list of peers. That pod continuously outputs that it’s sealed and restarts.

Surely an error is being logged somewhere but I’m not finding it, or not looking in the right place. Any ideas or suggestions?

vault operator raft join starts the join process, but it won’t complete until the node being joined, is subsequently unsealed. Only once unsealed, does it have the necessary cryptographic material to prove to the existing cluster that it is allowed to join.

That’s odd, a pod should not be restarting just because it is sealed - logically it needs to stay running to receive unseal keys.

Okay that makes sense that the pod would need to be unsealed before it joins. I’m using GCP so I would expect the other pods to unseal automatically as well. But if they’re not migrated they would never have been initialized.

I’m wondering if I should bring up the raft cluster, join all the nodes using something like Shamir. Then scale to 0 and perform the migration from the helper. Then scale the helper to 0 and one of the real vault sts to 1. That one pod should work like it currently does. I can get one migrated successfully. Then when I scale to 2 or 3 … I can unseal the pods manually, and then join to raft and hope it syncs the raft data from the first that was migrated.

Think that would work? I’m worried there might be data corruption unless raft will handle the existing data from the temporary initialization.

If you are using GCP Cloud KMS auto-unseal, and have the relevant configuration in your Vault server configuration file or supplied via environment variables, then yes, they should be able to automatically unseal when joining. I assumed that wasn’t the case, as they weren’t joining.

I’m pretty sure migrating into an existing 3-node Raft cluster won’t go as you are hoping it would, I suspect vault operator migrate will either refuse to work, or it will forcibly kick all other nodes out of the cluster it is targeting as destination.

Instead, you need to go back to looking at fixing the migration process you were already using.

The StatefulSet definition you posted looks weird - it has random bits of corruption:

podManagementPolicvaultraftel ?!?!

However in it, I see:

which seems wrong, as you’d want to migrate into one of the PVCs that will be used by one of your actual final Vault nodes, i.e. data-vault-0 - otherwise, how will your data actually get in to the final Raft cluster?

Ah, sorry about that messed up Statefulset - I think I fixed the corrupted pieces. I also changed the volume name to match the intention. It looked like the wrong PVC, but I’m working with the real Vault PVC, not one for the migrator. I removed the volumeClaimTemplate so this STS should only be using an existing volume.

Until I run a migration on the first replica, none of them auto-unseal with GCP. They just go into a crash loop outputting lines like this (all of the pods do this - when one is migrated it will unseal and function normally - the rest won’t and can’t be joined),

vault-1 vault 2022-12-20T17:12:15.744Z [WARN]  failed to unseal core: error="stored unseal keys are supported, but none were found"
vault-1 vault 2022-12-20T17:12:15.969Z [INFO]  core: security barrier not initialized
vault-1 vault 2022-12-20T17:12:15.969Z [INFO]  core.autoseal: seal configuration missing, but cannot check old path as core is sealed: seal_type=recovery

Update: I’m not sure if it’s relevant or not, but when I list peers I see the cluster_addr as the address.

/ $ vault operator raft list-peers -tls-skip-verify
Node                                    Address                                State     Voter
----                                    -------                                -----     -----
d14c8e2e-a816-9445-47a7-320aa33fc91a    vault-0.vault-internal:8201            leader    true

I initialized the vault-1 pod and it auto-unsealed with GCP. But I still can’t get it to join. I’m trying commands like the following but they never join the pool.

vault operator raft join -tls-skip-verify vault-1.vault-internal:8201
vault operator raft join -tls-skip-verify https://10.200.67.192:8200 # this is the pod IP for vault-1

I think I’m going to try adding a retry_join block in the storage config to see if auto-joining is any better.

I think I may have misunderstood how the join command works. I think I should be execing into the pods that didn’t have the migration run on them. And running the join command on the migrated pod for all of them. Something like this on each of the remaining pods,

vault operator raft join -tls-skip-verify https://vault-0.vault-internal:8200

I’m just not sure what the host is supposed to be those. I’ve also tried the pod’s IP address. I get errors like this.

Error joining the node to the Raft cluster: Post "https://10.200.84.233:8200/v1/sys/storage/raft/join": Gateway Timeout

I was looking at this tutorial (steps 10 and 11) hoping it might help. Vault Installation to Minikube via Helm with Integrated Storage | Vault | HashiCorp Developer

Part of the problem is I have to run that join command quickly because the pods restart every minute or so.

I can run these from other pods in the cluster and it gets the index.html fine - I don’t know what the gateway issue is about.

wget --no-check-certificate https://10.200.84.233:8200/ui/
wget --no-check-certificate https://vault-0.vault-internal:8200/ui/

Yes, indeed, join is a command you run for individual nodes, to tell them where to find the rest of the cluster.

You don’t have to exec in to each pod, but it can be a convenient way to do it. Alternatively you can send the command over HTTP to each pod, from outside the pod.

Actually, I prefer to just configure the HTTP addresses of each node directly in the Vault configuration file, in retry_join blocks - then they just try to join automatically:

storage "raft" {
  path = "/vault/data"
  retry_join {
    leader_api_addr = "http://vault-0.vault-internal:8200"
  }
  retry_join {
    leader_api_addr = "http://vault-1.vault-internal:8200"
  }
  retry_join {
    leader_api_addr = "http://vault-2.vault-internal:8200"
  }
}

A join command must be given an URL to the Vault API address of another node which is already an unsealed active member of the cluster to be joined.

Gateway Timeout is a weird error to be getting there… can your local Vault CLI (the one where you are running vault operator raft join) even reach the pod? Is it ending up talking to some proxy server instead, somehow?

I am convinced this is the real problem - the pods should not be restarting, and you need to figure out why they are and fix that, first.

I think this guide might help to do such a thing (it worked for me)

I’ve been thinking about the last two comments. I removed the liveness and readiness probes as mentioned by @fram.souza14 and that did indeed prevent the pods from restarting. Thank you for the link, that has lots of good info in it.

I added the retry_join blocks again as you suggested @maxb . It will probably work once the other 4 pods are unsealed.

What @fram.souza14 outlines is pretty much what I’ve been doing but a different route. I can get vault-0 to migrate successfully, unseal with GCP, and run successfully. That’s not a problem.

I think I’m down to how can I unseal the last 4 pods? From vault-1 I just tried vault operator -migrate -tls-skip-verify but no migration seal is found,

/ $ vault operator unseal -migrate -tls-skip-verify
Unseal Key (will be hidden):
Error unsealing: Error making API request.

URL: PUT https://127.0.0.1:8200/v1/sys/unseal
Code: 500. Errors:

* can't perform a seal migration, no migration seal found

From some of the documentation and examples I’ve ready, the transit seal is used to unseal the other pods. I’m going to look into this today and see if it helps.

Update: I’ve also been looking through one of the example scripts used in the official raft minikube tutorial learn-vault-raft/cluster.sh at main · hashicorp/learn-vault-raft · GitHub It looks like they bring up vault-0 and vault-1 with different configs. This idea has crossed my mind, but not sure how to make it work with the STS. All the pods get the same configmap.

This is a command to do a completely different kind of migration to what you are doing. You are migrating your storage from Consul to Raft, but this command is for e.g. if you were migrating your seal from gcpckms seal to shamir seal, for example.

The transit seal is only used if you are already using the transit seal, but you’re using gcpckms.

Do not follow that example, that is because their first Vault instance is a totally different cluster to the others.

That helps clarify several things. In case something stands out here is my vault config.

          ui                 = true
          disable_clustering = false
          disable_mlock      = true

          listener "tcp" {
            tls_disable     = false
            address         = "0.0.0.0:8200"
            cluster_address = "0.0.0.0:8201"
            tls_cert_file   = "/vault/userconfig/vault-ssl-certs/fullchain.pem"
            tls_key_file    = "/vault/userconfig/vault-ssl-certs/privkey.pem"
          }

          seal "gcpckms" {
            project     = "project"
            region      = "us"
            credentials = "/vault/userconfig/vault-gcp-key/gcp-key.json"
            key_ring    = "vault-keyring"
            crypto_key  = "vault"
          }

          storage "raft" {
            path = "/vault/raft"
              retry_join {
                leader_api_addr = "https://vault-0.vault-internal:8200"
              }
              retry_join {
                leader_api_addr = "https://vault-1.vault-internal:8200"
              }
              retry_join {
                leader_api_addr = "https://vault-2.vault-internal:8200"
              }
              retry_join {
                leader_api_addr = "https://vault-3.vault-internal:8200"
              }
              retry_join {
                leader_api_addr = "https://vault-4.vault-internal:8200"
              }
          }

          service_registration "kubernetes" {}

I have deployed the cluster with that configuration and successfully migrated vault-0. It sounds like I need to figure out how to get the last 4 pods unsealed before I can proceed. If they’re not migrated I just don’t understand how I can do that. None of them are initialized.

If I initialize them I can unseal them, but then they’re a standalone raft instance and won’t join to vault-0. It seems like a catch 22, but surely I’m overlooking something. I’m wondering if I need to provide certs in the retry_join block. But I can run wget --no-check-certificate https://vault-0.vault-internal:8200/ui/ from vault-1. I don’t have curl available, but I did check wget --no-check-certificate --post-data 'leader_api_addr=https://vault-0.vault-internal:8200' https://vault-0.vault-internal:8200/v1/sys/storage/raft/join from vault-1 and got a 400 bad request.

Here’s an example calling /sys/health on vault-0 from vault-1 to make sure we have network access.


~ $ hostname
vault-1
~ $ wget --no-check-certificate https://vault-0.vault-internal:8200/v1/sys/health
Connecting to vault-0.vault-internal:8200 (10.200.8.199:8200)
saving to 'health'
health               100% |**************************************************************************************************************************************************************|   295  0:00:00 ETA
'health' saved
~ $ cat health
{"initialized":true,"sealed":false,"standby":false,"performance_standby":false,"replication_performance_mode":"disabled","replication_dr_mode":"disabled","server_time_utc":161655415,"version":"1.12.0","cluster_name":"vault-cluster-b87d9bd7","cluster_id":"c049b162-5048-4425-68f7-3122a70c1138"}

The way it works, is once a join succeeds (either because of retry_join or vault operator raft join) the node attempting to join enters a kind of “provisionally initialised” state, where it has learnt the basic seal configuration from the node it wants to join.

From that state, it attempts to unseal, and finish the join.

You have mentioned

and yes you do, if the cert being used for the Vault API is one that is not trusted by the default trusted CAs of the OS in the Vault container.

This would certainly prevent your nodes joining.

Documentation on option names in the retry_join block can be found at Integrated Storage - Storage Backends - Configuration | Vault | HashiCorp Developer

Do you think I need to redo the certificates? Right now they are self-signed using something like this,

openssl x509 -req -sha256 -days 730 -in ./snakeoil.csr -signkey ./privkey.pem -out ./fullchain.pem
openssl req -new -newkey rsa:4096 -nodes -keyout ./privkey.pem -out ./snakeoil.csr -subj "..."

These are what the listener has been configured with for quite some time. I thought I could reuse them, but I’m not sure what leader_ca_cert_file should be.

retry_join {
  leader_api_addr         = "https://vault-0.vault-internal:8200"
  leader_client_cert_file = "/vault/userconfig/vault-ssl-certs/fullchain.pem"
  leader_client_key_file  = "/vault/userconfig/vault-ssl-certs/privkey.pem"
}

I’m still seeing the same Gateway Timeout.

The leader_client_* family of options are only relevant if you’re doing TLS with client authentication, which is yet another layer of complexity on top of standard TLS. I don’t think you’re doing that?

IIUC the only setting you should be adding is leader_ca_cert_file pointing to your fullchain.pem.