Error taking snapshot

Hello, I started to get the following error when running a backup script utilizing approle role-id and secret-id. This issue started recently and I am not sure what the problem is.

Error taking the snapshot: incomplete snapshot, unable to read SHA256SUMS.sealed file

when trying to take backup running command

vault operator raft snapshot save /var/lib/vault/snapshots/backup.snapshot

1 Like

Sounds like the user you’re running the snapshot command from doesn’t have read access to the directory where vault is storing it’s data. Possibly it may not write write access to /var/lib/vault/snapshots.

Hi @rwilliams-devmon ,

This is a consequence of Add code to api.RaftSnapshot to detect incomplete snapshots by ncabatoff · Pull Request #12388 · hashicorp/vault · GitHub - you were probably already having these issues, but now we’re detecting them. Probably your autoseal is failing, maybe (guessing) because it’s transit and the token has expired?

The token for the backup is generated by an approle using a role-id and secret-id, so a new temp token gets created every time we run a backup. I verified that the token is generated and working correctly.

Although our transit auto-unseal is broken currently and the token is not being renewed after 32 days because the max ttl has been met. Usually, when this happens the vault cluster fails but it has not, thankfully, which is weird.

What is the best way to set up transit auto-unseal because I keep failing and whenever this happens our entire environment loses access to vault credentials? There use to be a tutorial on Hashicorp Learn that allowed you to go through the process. It is no longer there.

Thanks for the response, when looking at the storage location I see a new backup being stored in /var/lib/vault/snapshots.

If the token for the snapshot were the issue then the snapshot request would fail with a permissions error. The error you’re seeing is typically due to an autoseal issue - sounds like that’s the case here too.

Is this the tutorial you’re thinking of: Auto-unseal using Transit Secrets Engine | Vault - HashiCorp Learn ?

1 Like

Thanks for the response, I was referring to another tutorial that was on Hashicorp Learn but it looks like it is no longer available. Issuing a new auto-unseal token fixed the issue. Thanks for the help!

has this actually been resolved? I am getting the same error message on a test setup (5 nodes, integrated storage + external LB) it seems like out of 20 requests 4 succeed but the order appears to be random

Was facing the same issue, however I’m using auto-unseal with a HA and raft.

Per the comments above I tried to re-key vault per the instructions here but ran in to another error around expired secrets.

In resolving the issues with expired secrets I’ve discovered that simply bringing down each of the nodes and allowing them to spin back up and auto-unseal has resolved the issue - snapshot backup is now working as expected. The vault operator step-down was super helpful in bringing down the master.

As far as this error goes I think Vault has some work to do. I’ve noted 3 mechanisms for creating snapshots with only 1 highlighting the issue with the snapshot and failing, and the other 2 methods generating binaries that were corrupt which I only discovered when attempting to restore.

  1. vault cli - shows the error
  2. vault UI - creates broken binary
  3. curl - creates broken binary

Would hurt so bad to realise the backups you’ve been taking are corrupt when you go to restore!

2 Likes

Hi @andrew.klimovski ,

Could you file a github issue for case (2) please? I don’t know how to make things better for curl, but we should be able to improve the UI.

Going through the change notes it looks like this feature was rolled in to release v1.9.0, however my vault instance is on v1.8.X and CLI tool on v1.9.2 which is likely why we weren’t seeing any issues via the UI or curl. Once we’ve upgraded if we see the issue crop up again I’ll raise a ticket.

I was originally receiving this error due to one of the auto-unseal keys being expired.

Hmmm if your auto-unseal “token” expires, you’d end up with a sealed instance. That’s probably a bigger deal than your backup not running. :slight_smile:

There is a bug with snapshot save when running it from standby nodes.
This bug is tracked here: `vault operator raft snapshot save` and `restore` fail to handle redirection to the active node · Issue #15258 · hashicorp/vault · GitHub

The workaround at the moment is to run the snapshot from the active/leader node.
If you have consul DNS setup, you achieve this by doing:

VAULT_ADDR=https://active.vault.service.${dc}.consul:8200

2 Likes

Just was about to confirm it, when i run snapshot save on standby nodes, it says exactly Error taking the snapshot: incomplete snapshot, unable to read SHA256SUMS.sealed file, but when i run it on leader, it works as it should.

On the leader node I get

Error taking the snapshot: Error making API request.

URL: GET http://127.0.0.1:8200/v1/sys/storage/raft/snapshot
Code: 403. Errors:

* permission denied

but all other I get

Error taking the snapshot: incomplete snapshot, unable to read SHA256SUMS.sealed file

what is wrong here?

Hello

I would say, that you are using token, with wrong permissions to create snapshots on leader node regarding

Error taking the snapshot: Error making API request.

URL: GET http://127.0.0.1:8200/v1/sys/storage/raft/snapshot
Code: 403. Errors:

* permission denied

and regarding

Error taking the snapshot: incomplete snapshot, unable to read SHA256SUMS.sealed file

This seems to me, that this is default error message, when you are trying to create snapshot on stand-by node (not the leader node).

I’ve used to root token which generated after installation

Are you using de -dev mod ? (I see you use the http scheme). Perhaps it does not work with the -dev mode because it’s a in-memory storage, not raft.

It doesnt matter, if you use root token - if the node is not leader. The snapshot is available only on leaders.