My organization is running a HA Vault Cluster in AWS, using EC2 instances across three availability zones. Usually we do not have any significant difficulties rolling the EC2 instances for patching or other updates, but when problems do arise we’ve been able to restore the cluster from our Raft snapshots, which we take every four hours.
Last night, however, our most recent Raft snapshots were unable to restore successfully. The
vault operator raft snapshot restore -force vault-backup.snap command returned a
context deadline exceeded error and checking the vault-error.log file showed this:
[ERROR] storage.raft.snapshot: failed to close snapshot decompressor: error="client disconnected"2021-07-21T23:36:09.985-0500 [ERROR] core: raft snapshot restore: failed to write snapshot: error="failed to read snapshot file: failed to read or write snapshot data: client disconnected"
We were eventually able to restore the cluster, from a snapshot taken last month, and have been working to replace our missing data and configurations. At this time, we suspect that the problem is that our recent snapshots are too large (the ones taken yesterday are about 2.5 GB), likely due to an excessive number of open leases.
While we are focusing on getting our Vault performance streamlined, to hopefully cut down on the size of our snapshots going forward, we’ve been able to successfully restore from large snapshots in the past and the snapshot that did work last night was about 1.9 GB.
Is there a maximum snapshot size that Raft can successfully restore from? If not, is it possible to change the configuration on our cluster to allow for larger snapshots to be restored, such that it won’t run into a
"failed to read snapshot file: failed to read or write snapshot data: client disconnected" error?