Vault transit key versions lost/changed after service restart

Hello there,

we’re using multiple Vault servers with PostgreSQL as the storage backend and encountered a problem where all transit key versions created before the Vault service restart were not lost as we can still see the original number of key versions, but are not useable anymore to decrypt the previously encrypted content.

The failed decryption error message is

Code: 400. Errors:* cipher: message authentication failed

New versions/rotated of the transit key do work as expected after the restart.

This service uptime was multiple months before we restarted it, if that is helpful in any way.
As we first suspected auto rotation or the minimum decrypt version, but this is a redacted output of the transit key:

vault read transit/keys/flux-secrets
Key Value

allow_plaintext_backup false
auto_rotate_period 0s
deletion_allowed false
derived false
exportable false
keys map[1:1658327089 10:1658364792 …]
latest_version 6508
min_available_version 0
min_decryption_version 1
min_encryption_version 0
name flux-secrets
supports_decryption true
supports_derivation true
supports_encryption true
supports_signing false
type aes256-gcm96

We are also unable to reproduce this in QA.
After creating over 6000 transit key versions and then restarting the Vault service, we can still use the original v1 version of the transit key to decrypt.
Maybe there’s a time component I’m missing here.

Any help would be greatly appreciated.

Best regards,

With ha_enabled set to true, so that the various Vault servers are correctly using the database to negotiate between themselves which one is active?

Because if not, they will all have their own conflicting in-memory cache of the storage, and overwrite each others writes to the database.

Thank you, that was indeed the reason.
We had ha_enabled not set in LIVE, but in PRELIVE and QA.

We’ve tried to reproduce the problem in QA by removing ha_enabled from the storage setting and restarting and even rebooting this Vault instance, but for some reason it would still use the previous “correct” transit keys.

We then tested a service restart in LIVE with the setting and the transit keys still worked after that.

Again thank you @maxb

Reproducing this scenario would require multiple Vault instances attempting to use the same storage. An access pattern with writes creating key versions spread across these instances would likely be an effective trigger for stored keys being overwritten.