We have an environment with a mix of manual unsealed Vault servers (we call them the “master” Vault) and a number of auto-unsealed Vaults that use the transit provider of the “master” Vault.
In our lab environment, the “masters” are no longer able to manually unseal. Luckily, this is our lab environment where folks do POCs and test infrastructure/core changes.
As you may understand, having Vault suddenly be unable to unseal is a major concern, and something I want to get to the root cause of.
I’m trying to understand the steps involved in unsealing to debug where the unseal process has gone wrong. What are the steps that Vault actually performs to unseal? (ie. Fetch /vault/core/seal-config, gather keys, shamir.Combine() to form the master key, then decrypt ?? from key and verify against value ?? to get the disk encryption keys(?) whereupon the Vault is declared successfully unsealed?)
The 8 key holders in the lab are absolutely adamant that these are the correct keys and were the same used in February when this Vault was last unsealed. We run other manually unsealed Vault servers in non-prod and prod environments managed by this same team that are still unsealing fine. I’m confident they’re telling the truth and that a rekey hasn’t happened.
2x Vault 1.3.2 servers on VMs (recently upgraded directly from 1.1.2 in February)
3x etcd servers holding the Vault data
1 of 2 Vault servers is still running as active, with full access to the data and apparently working fine(?)
2nd Vault server is sealed and won’t unseal.
What we’ve done/tried:
Unsealing; The lab has a quorum of two. Upon entry of key 1, the log has the expected “[DEBUG] core: unseal key supplied” then “[DEBUG] core: cannot unseal, not enough keys:”. Upon entry of the 2nd key, the “[DEBUG] core: unseal key supplied” message appears, but then nothing else in the logs. The user sees “Unseal failed, invalid key” and the unseal count reverts to 0/2.
Rekeying: Also unable to reconstruct the master key, so rekey fails.
Checked etcd for corruption. None readily apparent. Revisions are comparable across nodes, all nodes report healthy.
Spun up a 2nd instance of the above environment in lab. Deployed via terraform & puppet, so we’re confident it’s configured exactly the same. Restored an etcd snapshot from a non-production Vault with manual unsealing, verified we can unseal that fine. (we can) Wiped etcd and restored lab etcd snapshot from Vault with manual unsealing. Restores fine, etcd says it’s all consistent. Same behavior as original lab environment, cannot unseal with expected keys.
Replaced 1.3.2 Vault binary with 1.2.4 on the off chance it’s something to do with new stored shares functionality. No dice. Still no unsealing.
Checked we’re still using legacy unseal by checking the core/seal-config key: /vault/core/seal-config