Hi,
I am trying to take a snapshot of a live 3-node Vault cluster with Raft storage, and restore it onto a single DR node on a different IP address. It’s in a different data centre, and the data changes only rarely, so a static snapshot is fine.
However, I have got stuck on getting the DR instance to come up on its new IP address after restoring the snapshot, which has the old Raft peer IPs in it. I’ve been through some forum posts, which took me to:
- documentation for backup and restore of Vault · Issue #5683 · hashicorp/vault · GitHub
- Backup - Restore - #2 by mikegreen
- Raft snapshot restore issue - #5 by awidner
- How to recover from permanently lost quorum while using Raft integrated storage with Vault. – HashiCorp Help Center
The test environment is all built in VMs. My main cluster is on 10.0.0.104 / 108 / 109, and the DR node is 192.0.2.51. Steps done so far:
- Take snapshot on main active node 10.0.0.104 (node_id “vault-dev1”)
- Install vault on DR node, with a new
vault-conf.hcl
with its own IP address
storage "raft" {
path = "/opt/vault-dev/data"
node_id = "vault-dev1"
}
cluster_addr = "https://192.0.2.51:18201"
api_addr = "https://192.0.2.51:18200"
disable_mlock = "true"
ui = "true"
listener "tcp" {
address = "192.0.2.51:18200"
tls_min_version = "tls10"
tls_cert_file = "/opt/vault-dev/certificates/vault-dev1.cert"
tls_key_file = "/opt/vault-dev/certificates/vault-dev1.key"
}
- To restore the snapshot, I need vault to be running, initialized and unsealed (with a temporary key)
[root@drvault vault-dev]# /opt/vault-dev/vault operator raft snapshot restore ~/202102041017.snap
Error installing the snapshot: Post "https://127.0.0.1:8200/v1/sys/storage/raft/snapshot": dial tcp 127.0.0.1:8200: connect: connection refused
[root@drvault vault-dev]# systemctl start vault-dev
[root@drvault vault-dev]# export VAULT_ADDR=https://192.0.2.51:18200
[root@drvault vault-dev]# /opt/vault-dev/vault operator raft snapshot restore ~/202102041017.snap
Error installing the snapshot: Post "https://192.0.2.51:18200/v1/sys/storage/raft/snapshot": x509: certificate is valid for 10.0.0.104, not 192.0.2.51
[root@drvault vault-dev]# export VAULT_SKIP_VERIFY=1
[root@drvault vault-dev]# /opt/vault-dev/vault operator raft snapshot restore ~/202102041017.snap
Error installing the snapshot: Error making API request.
URL: POST https://192.0.2.51:18200/v1/sys/storage/raft/snapshot
Code: 503. Errors:
* Vault is sealed
[root@drvault vault-dev]# /opt/vault-dev/vault operator init -key-shares=5 -key-threshold=2
... note the results
[root@drvault vault-dev]# /opt/vault-dev/vault operator unseal
Unseal Key (will be hidden):
...
[root@drvault vault-dev]# /opt/vault-dev/vault operator unseal
Unseal Key (will be hidden):
...
[root@drvault vault-dev]# /opt/vault-dev/vault operator raft snapshot restore ~/202102041017.snap
Error installing the snapshot: Error making API request.
URL: POST https://192.0.2.51:18200/v1/sys/storage/raft/snapshot
Code: 400. Errors:
* missing client token
[root@drvault vault-dev]# /opt/vault-dev/vault login
Token (will be hidden):
Success! You are now authenticated.
...
[root@drvault vault-dev]# /opt/vault-dev/vault operator raft snapshot restore ~/202102041017.snap
Error installing the snapshot: Error making API request.
URL: POST https://192.0.2.51:18200/v1/sys/storage/raft/snapshot
Code: 400. Errors:
* could not verify hash file, possibly the snapshot is using a different set of unseal keys; use the snapshot-force API to bypass this check
[root@drvault vault-dev]# /opt/vault-dev/vault operator raft snapshot restore --force ~/202102041017.snap
[root@drvault vault-dev]#
- So far, so good. Next, create
peers.json
file:
[
{
"id": "vault1-dev",
"address": "192.0.2.51:18201",
"non_voter": false
}
]
and restart the server. However when I do, I find that vault still tries to contact the original node 10.0.0.104, even though it has clearly picked up peers.json
:
Feb 04 11:11:56 drvault systemd[1]: Started Vault secret store.
Feb 04 11:11:56 drvault vault[1043]: ==> Vault server configuration:
Feb 04 11:11:56 drvault vault[1043]: Api Address: https://192.0.2.51:18200
Feb 04 11:11:56 drvault vault[1043]: Cgo: disabled
Feb 04 11:11:56 drvault vault[1043]: Cluster Address: https://192.0.2.51:18201
Feb 04 11:11:56 drvault vault[1043]: Go Version: go1.15.7
Feb 04 11:11:56 drvault vault[1043]: Listener 1: tcp (addr: "192.0.2.51:18200", cluster address: "192.0.2.51:18201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
Feb 04 11:11:56 drvault vault[1043]: Log Level: info
Feb 04 11:11:56 drvault vault[1043]: Mlock: supported: true, enabled: false
Feb 04 11:11:56 drvault vault[1043]: Recovery Mode: false
Feb 04 11:11:56 drvault vault[1043]: Storage: raft (HA available)
Feb 04 11:11:56 drvault vault[1043]: Version: Vault v1.6.2
Feb 04 11:11:56 drvault vault[1043]: Version Sha: be65a227ef2e80f8588b3b13584b5c0d9238c1d7
Feb 04 11:11:56 drvault vault[1043]: ==> Vault server started! Log data will stream in below:
Feb 04 11:11:56 drvault vault[1043]: 2021-02-04T11:11:56.316Z [INFO] proxy environment: http_proxy= https_proxy= no_proxy=
Feb 04 11:11:56 drvault vault[1043]: 2021-02-04T11:11:56.350Z [INFO] storage.raft.snapshot: reaping snapshot: path=/opt/vault-dev/data/raft/snapshots/3-5057-1612436954856
Feb 04 11:15:04 drvault vault[1043]: 2021-02-04T11:15:04.358Z [INFO] core.cluster-listener.tcp: starting listener: listener_address=192.0.2.51:18201
Feb 04 11:15:04 drvault vault[1043]: 2021-02-04T11:15:04.358Z [INFO] core.cluster-listener: serving cluster requests: cluster_listen_address=192.0.2.51:18201
Feb 04 11:15:04 drvault vault[1043]: 2021-02-04T11:15:04.362Z [INFO] storage.raft: raft recovery initiated: recovery_file=peers.json
Feb 04 11:15:04 drvault vault[1043]: 2021-02-04T11:15:04.367Z [INFO] storage.raft: raft recovery found new config: config="{[{Voter vault1-dev 192.0.2.51:18201}]}"
Feb 04 11:15:04 drvault vault[1043]: 2021-02-04T11:15:04.390Z [INFO] storage.raft: raft recovery deleted peers.json
Feb 04 11:15:04 drvault vault[1043]: 2021-02-04T11:15:04.396Z [INFO] storage.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:vault1-dev Address:192.0.2.51:18201}]"
Feb 04 11:15:04 drvault vault[1043]: 2021-02-04T11:15:04.398Z [INFO] core: vault is unsealed
Feb 04 11:15:04 drvault vault[1043]: 2021-02-04T11:15:04.398Z [INFO] core: entering standby mode
Feb 04 11:15:04 drvault vault[1043]: 2021-02-04T11:15:04.398Z [INFO] storage.raft: entering follower state: follower="Node at 192.0.2.51:18201 [Follower]" leader=
Feb 04 11:15:13 drvault vault[1043]: 2021-02-04T11:15:13.682Z [WARN] storage.raft: not part of stable configuration, aborting election
Feb 04 11:15:15 drvault vault[1043]: 2021-02-04T11:15:15.906Z [ERROR] core: error during forwarded RPC request: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.0.0.104:18201: connect: no route to host""
Feb 04 11:15:15 drvault vault[1043]: 2021-02-04T11:15:15.907Z [ERROR] core: forward request error: error="error during forwarding RPC request"
Feb 04 11:17:30 drvault vault[1043]: 2021-02-04T11:17:30.298Z [ERROR] storage.raft: failed to take snapshot: error="nothing new to snapshot"
As a result, the raft cluster doesn’t come up, still thinking it needs to talk to 10.0.0.104:
[root@drvault vault-dev]# /opt/vault-dev/vault operator raft list-peers
Error reading the raft cluster configuration: Get "https://10.0.0.104:18200/v1/sys/storage/raft/configuration": dial tcp 10.0.0.104:18200: connect: no route to host
[root@drvault vault-dev]# /opt/vault-dev/vault status
Key Value
--- -----
Seal Type shamir
Initialized true
Sealed false
Total Shares 5
Threshold 2
Version 1.6.2
Storage Type raft
Cluster Name vault-cluster-8cdcefbe
Cluster ID 4dd2d930-121b-c897-57c2-7f4cfe983099
HA Enabled true
HA Cluster https://10.0.0.104:18201
HA Mode standby
Active Node Address https://10.0.0.104:18200
Raft Committed Index 5060
Raft Applied Index 5060
It seems like I need to perform the “recover from permanently lost quorum” process whilst changing the stored IP address of the peer at the same time.
I did come across recovery mode, but couldn’t find any examples of how to use it - in particular what commands make use of /sys/raw
. When running the server in recovery mode, all the commands I tried, including vault operator raft snapshot restore
, give a 404 error. In any case, I’d prefer to perform disaster recovery using just the standard unseal keys, and not rely on having access to a recovery token which could have been misplaced.
Any clues as to where to go next?
Thanks in advance!