"consul snapshot save" is proving to be unreliable

We have a 3-node cluster running consul and vault. I’ve got a script that runs every hour to take a snapshot of the consul data for disaster recovery.

Recently, the “consul snapshot save” command has been failing on an irregular basis. I upgraded consul at the weekend to version 1.8.6 and ensured that the script is communicating with consul running on the same host as itself, but we’re still getting errors like this:

Error verifying snapshot file: failed to read snapshot file: failed to read or write snapshot data: unexpected EOF

Does it matter that I’m trying to use “consul snapshot save” on a follower rather than the leader?

Is there anything else I should be checking to stop errors like this happening?

I think in your case you should use -stale:

Two useful things to know:

  • How big is the snapshot when it does successfully generate?
  • How long does it take to generate it?

Each snapshot is around 655M.

It took 2m 11s to create a snapshot.

Interestingly, using -stale caused it to take 3m 3s.

I’ve since discovered that our use of Vault has been resulting in a lot of AWS IAM auth tokens being generated but not revoked when finished with. This has caused the snapshots to grow to this size. Cleaning up the tokens has reduced the backup size down to single-digit MB files.