Failed to Read Raft Snapshot File

My organization is running a HA Vault Cluster in AWS, using EC2 instances across three availability zones. Usually we do not have any significant difficulties rolling the EC2 instances for patching or other updates, but when problems do arise we’ve been able to restore the cluster from our Raft snapshots, which we take every four hours.

Last night, however, our most recent Raft snapshots were unable to restore successfully. The vault operator raft snapshot restore -force vault-backup.snap command returned a context deadline exceeded error and checking the vault-error.log file showed this: [ERROR] storage.raft.snapshot: failed to close snapshot decompressor: error="client disconnected"2021-07-21T23:36:09.985-0500 [ERROR] core: raft snapshot restore: failed to write snapshot: error="failed to read snapshot file: failed to read or write snapshot data: client disconnected"

We were eventually able to restore the cluster, from a snapshot taken last month, and have been working to replace our missing data and configurations. At this time, we suspect that the problem is that our recent snapshots are too large (the ones taken yesterday are about 2.5 GB), likely due to an excessive number of open leases.

While we are focusing on getting our Vault performance streamlined, to hopefully cut down on the size of our snapshots going forward, we’ve been able to successfully restore from large snapshots in the past and the snapshot that did work last night was about 1.9 GB.

Is there a maximum snapshot size that Raft can successfully restore from? If not, is it possible to change the configuration on our cluster to allow for larger snapshots to be restored, such that it won’t run into a "failed to read snapshot file: failed to read or write snapshot data: client disconnected" error?

Yeah, need to resolve that ASAP. That’s huge and will bite you later. But it seems you know that already which is good :slight_smile:

How long is the error showing up after you start the command? It is probably timing out.
Change your VAULT_CLIENT_TIMEOUT to something higher and retry.

Ah-ha! Yes, extending the VAULT_CLIENT_TIMEOUT let the command run long enough to fully process and restore the snapshot. Doubling it to 120 seconds was enough, though I probably could’ve set it to ~90 and still gotten it to complete.

Thank you!