I’m running a 5 node Vault cluster on Kubernetes which is unsealed by a separate cluster using the transit method. It’s been working well until today I got an alert stating I didn’t have at least 3 nodes. Sure enough, only one node is showing ready and even that one is having problems. Logging into that node…
Key Value
--- -----
Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 5
Threshold 3
Version 1.8.5+ent
Storage Type raft
Cluster Name vault-cluster-1c47917a
Cluster ID 256b9c98-2e48-41d1-2c26-7c405be5fa17
HA Enabled true
HA Cluster n/a
HA Mode standby
Active Node Address <none>
Raft Committed Index 1395606
Raft Applied Index 1395605
/ $ vault operator raft list-peers
Error reading the raft cluster configuration: Error making API request.
URL: GET http://127.0.0.1:8200/v1/sys/storage/raft/configuration
Code: 500. Errors:
* local node not active but active cluster node not found
The remote Vault with the auto-unseal transit engine is up and operational. A few questions.
What may have caused this? The pods have been up for 47 hours which is outside the window of when this went down.
Is there an easy and/or good way to bring this raft cluster back or do I need to start over and do a restore of the database?
Restore from backup is almost NEVER the answer. Unless there was a hardware issue or that someone a wrench got into the works it should be recoverable.
I was going to say that the most common issue that I have seen in my environment with this is that the token used to access the transit engine has expired. However, your one node says it’s unsealed, but I question that answer.
What is your backend for this cluster? Is this a PR cluster?
Since you’re on 1.8, also try running vault operator diagnose -config=.... on your node to see if it can point anything out.
My suggestion is:
Generate a new token on the transit engine cluster. [Just in case, won’t hurt anything]
Set the token (or remove the old one from the config if you have it hard coded)
Set your log_level=“DEBUG” and restart the pods.
Check the logs, pay attention to the init and unseal portions specifically.
Check the cluster status
If you’re still down, upload the logs to pastebin or somewhere and share the link here.
Excellent information. Here’s output from the “working” node.
/ $ vault operator diagnose -config=/vault/config/extraconfig-from-values.hcl
Vault v1.8.5+ent (10abdf02c159597fd916260e795c5dd480d4fb18)
Results:
[ failure ] Vault Diagnose
[ success ] Check Operating System
[ success ] Check Open File Limits: Open file limits are set to 1048576.
[ success ] Check Disk Usage: /vault/data usage ok.
[ success ] Check Disk Usage: /vault/config usage ok.
[ success ] Check Disk Usage: /home/vault usage ok.
[ success ] Check Disk Usage: /vault/file usage ok.
[ success ] Check Disk Usage: /etc/hosts usage ok.
[ success ] Check Disk Usage: /dev/termination-log usage ok.
[ success ] Check Disk Usage: /etc/hostname usage ok.
[ success ] Check Disk Usage: /etc/resolv.conf usage ok.
[ success ] Check Disk Usage: /vault/logs usage ok.
[ success ] Parse Configuration
[ failure ] Check Storage: Diagnose could not initialize storage backend.
[ failure ] Create Storage Backend: Error initializing storage of type raft: failed to create
fsm: failed to open bolt file: timeout
As you can see from the output, I’m using the raft with internal storage backend. I’m unclear how to perform step 2. According to the documentation (Auto-unseal using Transit Secrets Engine | Vault - HashiCorp Learn) I set it when I do a vault operator init but the vault is already initialized in this case.
Also, is there a way to list all active unseal tokens and see what their expiration is? I’d agree this is probably what happened. Does renewal of these keys require manual intervention or programming via API or is there a setting to do all this rotation?
On the instance that has the seal, find the policy that has the right permissions to unseal, lets assume its called autounseal-c1-policy. vault token create -orphan -no-default-policy -policy=autounseal-c1-policy
Set that in your config (hard code it for now to get your cluster up and running). That’s it.
For the future, my recommendation is to setup a userpass or use a client cert to authenticate a vault agent. Now the agent can get a token with the autounseal policy and renew it so that it doesn’t run out. Use a sink file in the agent, so that the token is available externally and vault can now use that token to auto-unseal.
This broke the cluster loose. Pods are back up and raft is happy. I’m going to look into user/pass combinations or client certificates instead of tokens to avoid this problem.
Now, it does open the next question which may make sense on a separate thread. But what is the recommended method of watching for, and potentially alerting about, expiring tokens. I’m assuming Prometheus has a metric related to this but if there’s a better method I’m open to hearing.