Vault restoration issue
Context :
We are running Vault in Kubernetes (GKE) and are trying to restore it from one namespace (vault) to another (vault-backup) in the same cluster using Helm.
PV (persistent volume) was backed up using the latest up to date backup and was bound to PVC that is being used by Vault deployed in the vault-backup namespace.
We have confirmed that volume is successfully mounted on all (3) statefulset pods.
Environment:
Vault Server Version : 1.6.1
Server Operating System/Architecture: Kubernetes (1.19)
Describe the bug :
While performing the restoration, Vault pods try to use the stale/older peer information which is stored in the raft configuration and mismatches the name of the new endpoints.
Reproducing the bug :
Restore the PV from latest backup and make it bound to PVC in the new namespace
velero-clone-50d5dac3-f743-4ee7-8f3b-f9ce4df7d51d 10Gi RWO Delete Bound vault-backup/data-vault-backup-1 standard 20h
velero-clone-822b8475-a13b-4b9c-a867-5edd16502249 10Gi RWO Delete Bound vault-backup/data-vault-backup-0 standard 20h
velero-clone-e0d7dcc6-6975-4abc-9212-822d27c945bc 10Gi RWO Delete Bound vault-backup/data-vault-backup-2 standard 20h
Deploy the helm chart using a different release name in the new namespace. Chart config is standard using this helm chart.
Helm release name : vault-backup
Namespace : vault-backup
Helm chart config for HA block:
ha:
enabled: true
raft:
enabled: true
config: |
ui = true
disable_mlock = true
listener "tcp" {
address = "[::]:8443"
tls_disable = false
tls_disable_client_certs = true
tls_cert_file = "/vault/userconfig/vault-external-tls/tls.crt"
tls_key_file = "/vault/userconfig/vault-external-tls/tls.key"
}
listener "tcp" {
address = "[::]:8200"
cluster_address = "[::]:8201"
tls_disable = false
tls_disable_client_certs = true
tls_cert_file = "/vault/userconfig/vault-internal-tls/tls.crt"
tls_key_file = "/vault/userconfig/vault-internal-tls/tls.key"
}
storage "raft" {
path = "/vault/data"
retry_join {
leader_api_addr = "https://vault-backup-0.vault-backup-internal:8200"
leader_client_cert_file = "/vault/userconfig/vault-internal-tls/tls.crt"
leader_client_key_file = "/vault/userconfig/vault-internal-tls/tls.key"
leader_ca_cert_file = "/vault/userconfig/vault-internal-tls/ca.crt"
}
retry_join {
leader_api_addr = "https://vault-backup-1.vault-backup-internal:8200"
leader_client_cert_file = "/vault/userconfig/vault-internal-tls/tls.crt"
leader_client_key_file = "/vault/userconfig/vault-internal-tls/tls.key"
leader_ca_cert_file = "/vault/userconfig/vault-internal-tls/ca.crt"
}
retry_join {
leader_api_addr = "https://vault-backup-2.vault-backup-internal:8200"
leader_client_cert_file = "/vault/userconfig/vault-internal-tls/tls.crt"
leader_client_key_file = "/vault/userconfig/vault-internal-tls/tls.key"
leader_ca_cert_file = "/vault/userconfig/vault-internal-tls/ca.crt"
}
}
seal "gcpckms" {}
telemetry {}
service_registration "kubernetes" {}
Pods status :
vault-backup-0 1/1 Running 0 39m
vault-backup-1 1/1 Running 0 39m
vault-backup-2 1/1 Running 0 39m
vault-backup-agent-injector-77d776745-v2c2x 1/1 Running 0 19h
vault-init-69dd576895-57rms 1/1 Running 0 18h
Vault status :
This is the status from 1 of the POD
vault status -tls-skip-verify
Key Value
--- -----
Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 1
Threshold 1
Version 1.6.1
Storage Type raft
Cluster Name vault-cluster-xyz
Cluster ID 1234
HA Enabled true
HA Cluster https://vault-0.vault-internal:8201
HA Mode standby
Active Node Address https://242.0.1.231:8200
Raft Committed Index 638221
Raft Applied Index 638221
The HA cluster here points to : vault-0.vault-internal
IT should be pointing instead to correct : vault-backup-0.vault-backup-internal
Also the Active node address is an old stale ip which is no longer existing.
Pod Logs :
Vault-backup-0 pod logs shows :
2021-08-18T10:38:39.659Z [ERROR] storage.raft: failed to make requestVote RPC: target="{Voter c2fdb324-ff51-efaf-ec71-0ad82ace6bb0 vault-1.vault-internal:8201}" error="dial tcp: lookup vault-1.vault-internal on 243.0.0.10:53: no such host"
2021-08-18T10:38:39.674Z [ERROR] storage.raft: failed to make requestVote RPC: target="{Voter 51236593-00f3-4932-e37e-f4d035f0235b vault-2.vault-internal:8201}" error="dial tcp: lookup vault-2.vault-internal on 243.0.0.10:53: no such host"
Raft storage volume:
mounted /vault/data volume keeps the old peer configuration. Even the pod has a proper ENV.
/vault/data $ grep -r 'vault-internal' *
raft/raft.db:/LastVoteCandvault-2.vault-internal:8201LastVoteTerm
raft/raft.db:/LastVoteCandvault-2.vault-internal:8201LastVoteTerm
raft/raft.db:/LastVoteCandvault-2.vault-internal:8201LastVoteTerm
vault.db:latest_confi C$c7d12e1a-ab43-32f6-39b3-c9c8d419b11dault-0.vault-internal:8201C$c2fdb324-ff51-efaf-ec71-0ad82ace6bb0ault-1.vault-internal:8201C$51236593-00f3-4932-e37e-f4d035f0235bault-2.vault-internal:8201latest_indexe&data
vault.db:latest_confi C$c7d12e1a-ab43-32f6-39b3-c9c8d419b11dault-0.vault-internal:8201C$c2fdb324-ff51-efaf-ec71-0ad82ace6bb0ault-1.vault-internal:8201C$51236593-00f3-4932-e37e-f4d035f0235bault-2.vault-internal:8201latest_indexe&data
...
/vault/data $ %
Questions:
What commands we need to run to make sure we can replace old hostnames/service names with the new ones one in our namespace.
Is this approach for restoration (to a new namespace) recommended and tested by someone else?