Vault HA Nodes Disagree

Hello all,

I’m having a very odd issue, where after a vault operator rekey operation, the nodes in the HA cluster disagree.

I’ve attempted to -cancel the operation and restart it, but nothing has worked. When we attempted the rekey, we used 9 shares with 4 threshold, and though 5 people put their keys in (current threshold) the rekey operation simply did not occur, and the progress of the rekey operation with the same nonce reset itself to 0/5 after 5 people put in their keys.

Here’s a screenshot of the same status command directly to the lead node (not through the LB) where the Rekey Progress inexplicably resets itself.

And a screenshot of the status of each node in the HA cluster:

As you can see, the nodes in the same cluster disagree about their shares and threshold values. And none of them match what we were attempting, which was 9 shares and 4 threshold.

I did a quick search of the logs on the /sys/rekey endpoints and the only thing that shows up in the logs are the calls to /sys/rekey/backup when I attempted to see what the current keys were.

When I started the re-key operation, my command was similar to the following:

vault operator rekey \
-init \
-key-shares=9 \
-key-threshold=4 \
-pgp-keys="key1.asc,key2.asc,key3.asc,key4.asc,key5.asc,key6.asc,key7.asc,key8.asc,key9.asc" \

Does anyone have any clues as to what I could attempt here?

Hi Stuart,

So we have 10/5, 8/4, and 9/4 as key configurations in those screenshots and your request. I don’t know of any that config could change other than rekey requests, and I’m surprised you don’t see them in the audit logs. Perhaps some were issued on other servers and you’re using file audit logs, so you need to look in those other server’s local logs?

I’m happy to help try and explain what happened, but I’d like to know the actual sequence of events, so let’s try and sort that out first.

Alright, let’s try and construct a timeline here for you.

  • Prior to April 20th, we had a 8 share / 4 threshold setup.
  • On April 20th, we conducted a rekey operation via the Load Balancer that we assumed was successful. At this time, we intended to set a 10 share / 5 threshold setup. The active node reported success and our status query seemed normal. The rest of the Vault endpoints were working normally.
  • On June 11, we conducted a rekey operation, initially via the Load Balancer. At this time, we intended to set a 9 share / 4 threshold setup. The Active node gave us a nonce from the init operation and the nonce was distributed to our operator pool.
  • Using the nonce value, we asked each operator to send the vault operator rekey -nonce=<nonce> to the load balancer. I functioned as the lead during this time, sending the init and, after submitting my own key, checking status with vault operator rekey -nonce=<nonce> -status.
  • During the first attempt, everything seemed to progress normally, until the 5th key was submitted. Normally, when the final key is submitted, the last operator gets a response containing the encrypted keys, which are then shared back with the team. None of the operators ever got this response. Additionally, during the time when people were submitting keys, I noticed while watching the status that the Progress suddenly reset itself.
  • Because we could not get the first operation to work, we sent the -cancel parameter to Vault via the CLI tool. For the second attempt, we simply restarted the process, and the same thing happened as per the first attempt.
  • However, during this time we set a watch on vault operator rekey -nonce=<nonce> -status and discovered that responses from Vault were becoming inconsistent. Because the responses from Vault using the LB were inconsistent, at one point I thought that the operator key share had been updated to 8 shares and 4 threshold, which prompted me to run vault operator rekey -backup-retrieve and see what the active keys were. This command gave me 10 keys in response, which seemed to indicate further inconsistency in responses.
  • Suspecting some type of issue with the load balancer, because the responses we were receiving were becoming inconsistent, we decided to skip the load balancer and hit the lead node directly. We set VAULT_ADDR to https://10.x.x.x:8200 and used the -tls-skip-verify flag in our commands as TLS had been set up at the LB level. We ensured that each operator was using the exact same command against the same node when submitting their keys, and were interactively inputting their keys using the command line tool.
  • Skipping the LB did not produce any different results. We still saw the same response from Vault, where the Progress reset itself to 0 after sufficient keys were submitted.
  • We attempted the direct pattern twice, using the -cancel operation between attempts and generating new nonce values each time.
  • After an hour of this, I called the rekey session off and began troubleshooting.

We have set up the audit device and have a Splunk forwarder consuming it. When searching the Splunk index for Vault, I first did a blanket search for /sys/rekey which returned results. However, it became clear that the only audit responses being reflected in the audit logs were on the /sys/rekey/backup endpoint. I could not see any requests or responses logged on any of the sys/rekey/init, sys/rekey/update or sys/rekey/verify endpoints, even though I sent CLI commands that should have hit those endpoints.

At this point we’re not sure of the health of our cluster. Perhaps adding to the HA pool would allow us to kill the badly behaving nodes. We’ve got a number of theories, but unsure of tests that would allow us to move forward confidently.

Thanks for the detailed timeline. I have an explanation for the missing audit logs: only “authenticated” requests get audited, and really we mean “vault token authenticated”. The rekey requests are authenticated via the unseal or recover keys, and so aren’t getting audited. We recently added auditing to the generate root endpoint, which was in the same boat. Would you mind opening another GH issue to ask for auditing of the rekey endpoints?

Now, as to your issue: I still don’t have a theory as to what’s happening, but I will dig in more in light of your timeline. Have you looked in the server logs too? Issues that occur during rekeying are the kind of thing that would likely result in an error log output.

I have created as per your suggestion. Will report back after I have taken a look at the lead node’s logs.