Failed to Read Raft Snapshot File

ppestinger · July 22, 2021, 3:43pm

My organization is running a HA Vault Cluster in AWS, using EC2 instances across three availability zones. Usually we do not have any significant difficulties rolling the EC2 instances for patching or other updates, but when problems do arise we’ve been able to restore the cluster from our Raft snapshots, which we take every four hours.

Last night, however, our most recent Raft snapshots were unable to restore successfully. The vault operator raft snapshot restore -force vault-backup.snap command returned a context deadline exceeded error and checking the vault-error.log file showed this: [ERROR] storage.raft.snapshot: failed to close snapshot decompressor: error="client disconnected"2021-07-21T23:36:09.985-0500 [ERROR] core: raft snapshot restore: failed to write snapshot: error="failed to read snapshot file: failed to read or write snapshot data: client disconnected"

We were eventually able to restore the cluster, from a snapshot taken last month, and have been working to replace our missing data and configurations. At this time, we suspect that the problem is that our recent snapshots are too large (the ones taken yesterday are about 2.5 GB), likely due to an excessive number of open leases.

While we are focusing on getting our Vault performance streamlined, to hopefully cut down on the size of our snapshots going forward, we’ve been able to successfully restore from large snapshots in the past and the snapshot that did work last night was about 1.9 GB.

Is there a maximum snapshot size that Raft can successfully restore from? If not, is it possible to change the configuration on our cluster to allow for larger snapshots to be restored, such that it won’t run into a "failed to read snapshot file: failed to read or write snapshot data: client disconnected" error?

mikegreen · July 22, 2021, 4:05pm

Yeah, need to resolve that ASAP. That’s huge and will bite you later. But it seems you know that already which is good

How long is the error showing up after you start the command? It is probably timing out.
Change your VAULT_CLIENT_TIMEOUT to something higher and retry.

ppestinger · July 22, 2021, 6:52pm

Ah-ha! Yes, extending the VAULT_CLIENT_TIMEOUT let the command run long enough to fully process and restore the snapshot. Doubling it to 120 seconds was enough, though I probably could’ve set it to ~90 and still gotten it to complete.

Thank you!

Topic		Replies	Views
Snapshot apparently too big to restore Vault	4	401	November 27, 2023
"consul snapshot save" is proving to be unreliable Consul consul-snapshot	5	1238	March 3, 2021
Failed to save raft snapshot Vault raft	3	2422	November 25, 2020
Raft snapshot restore issue Vault	6	1568	May 17, 2022
API Error – Unsupported Path During Backup of Vault Vault raft , vault	2	3932	February 22, 2022

Failed to Read Raft Snapshot File

Related topics