Data Migration + Seal Migration --> Raft Snapshot (Bug?)

I still think there is a bug in this.

A summary of what we have tried:

Ultimately, the goal which we were trying to achieve can be captured almost perfectly by the following brief blog post. Note: the author of the blog post uses 1.6.0 while we are using 1.8.1. HashiCorp Vault - Raft Storage Snapshot Recovery - IT Insights Blog

Important Note: We initially had tried using the open source vault snapshot agent (basically the same thing as the Enterprise Vault automated snapshot feature/offering), which takes snapshots on a given interval based on a policy, and stores them somewhere (S3 in our case), and you can then restore from those snapshots at your leisure. This is the step-by-step guide we followed for this (only difference is we used our own S3 storage location instead of the MinIO Docker container as described in the post). How to Backup a HashiCorp Vault Integrated Storage Cluster with MinIO | by Nicolas Ehrman | HashiCorp Solutions Engineering Blog | Medium

The production cluster source where snapshots were being taken from is a transit-auto-unsealed, 3-node cluster with Raft storage. Important Note: this cluster was the product of not only a standard data migration from a CONSUL backed legacy Vault cluster that was sunset recently, but also a SEAL migration from Shamir to auto-unsealing via a single transit node servicing the 3 node production RAFT cluster.

^ This is a point that we suspect could be the root cause of the snapshot restoration issues and odd error messages. We were unable to find any information regarding the migration’s potential effects on the underlying Raft database and its ability to have snapshots saved (without corruption or other unexpected behavior), as well as then restored to a new “DR” or backup cluster in a separate region, which did not have the same storied past of migrations and seal migrations that “prod cluster A” experienced.

  • In taking the snapshots, as well as restoring (or attempting to), we followed the instructions in the documentation meticulously, including the little gotcha’s like including the -force flag on the restore command which effectively uses an entirely separate HTTP API endpoint during the restoration process.

  • We have tried manually taking snapshots in a few different ways, including locally from the leader node using the CLI, remotely from a local terminal with the VAULT_ADDR pointed to the leader node IP, and remotely from a local terminal with the VAULT_ADDR pointed to the load balancer, and from the UI viewing the Raft cluster and using the snapshot dropdown button.

  • We can’t restore this snapshot onto a DR cluster with the same exact raft backend configuration, node configuration, etc, because there appears to be some sort of data corruption within the current prod cluster when it tries to write the snapshot

  • Command to save snapshot: vault operator raft snapshot save FILENAME.snap

  • we copy the snapshot over to the new cluster in a variety of different ways including directly over SCP, indirectly via S3 as an intermediary, and indirectly via downloading the snapshot to our local machines and then uploading to new host

  • Command to restore snapshot: vault operator raft snapshot restore -force FILENAME.snap

  • Initial efforts produced an error message indicating there is an unexpected EOF encountered upon trying to restore on the new cluster. Less than helpful error message for diagnostics.

  • If we try to retrieve the snapshot from production using the Raft Storage UI → Download Snapshot functionality as root, we cannot expand the snapshot from it’s .gz format (neither via double clicking it in Finder nor using the tar utility to extract it either). We get a variety of different error messages depending on said methods of attempting extraction. From the terminal with tar, we get the following screenshot:

  • Expanding that .gz archive does not show the SHA256SUM files; only the state.bin and meta.json. the meta.json file looks incomplete

  • From the GUI, we get another new and exotic error which we cannot seem to get much information on when researching further online (screenshot below):

We are incredibly thankful for anyone offering to try and help us out, and I’m happy to demonstrate some of this in a live demo if that would help. This is actually not even an exhaustive list of all of the different permutations and combinations of ways to get this snapshot/restore process to work that we have tried, there are actually even more, but it pretty much covers it.

Sincerely,
Cameron

Sounds like you’re maybe hitting the issue that Add code to api.RaftSnapshot to detect incomplete snapshots by ncabatoff · Pull Request #12388 · hashicorp/vault · GitHub (available in 1.8.3) is there to detect, whereby incomplete raft snapshots are returned by the CLI without any error being reported. The cause is typically that your auto-unseal isn’t working reliably, e.g. transit auto-unseal and the token it uses has expired. A working seal is necessary to product snapshots.