New vault node is not joining existing cluster after reinstallation of OS and clearing of data

Hi folks, interesting problem I’ve been banging my head against for the past few hours. We have a 3 node Vault cluster that’s been working fine for a long time. Currently using Vault 1.14.0 with integrated storage.

Earlier today one of the nodes in the cluster started “acting funny” (co-workers words) and we found that an alert was never sent about disk space where the machine had run itself to 100% usage on disk. This of course makes Vault and everything else very unhappy.

After cleaning up some old logs, we were still seeing high disk usage, and found that the vault.db file sat nice and fat at 49Gb. From earlier days I remembered that trying to start Vault on this big of a database file would take a very long time, so I decided to do some maintenance, by just clearing out all of Vault’s data, and reconfiguring raft storage to write to a new attached volume where we can have all the space we want (and log files filling disks no longer cause terminal Vault angst).

I also upgraded to the most recent version of Vault (1.15.2). Then on the existing cluster (now made up of 2 nodes), I ran the operator raft remove-peer command to remove the dead peer so I could do a “clean” join.

The new node was started, and vault operator raft join was ran, pointed at the current leader node in the cluster. This ran successfully, and in the log files we saw this:

core: failed to retry join raft cluster: retry=2s err="waiting for unseal keys to be supplied"

As expected, really. We entered the unseal keys and found the following in the log files:

core: attempting to join possible raft leader node: leader_addr=http://10.x.x.11:8200
core.cluster-listener: serving cluster requests: cluster_listen_address=10.x.x.9:8201
storage.raft: creating Raft: config="&raft.Config{ProtocolVersion:3, HeartbeatTimeout:15000000000, ElectionTimeout:15000000000, CommitTimeout:50000000, MaxApp
endEntries:64, BatchApplyCh:true, ShutdownOnRemove:true, TrailingLogs:0x2800, SnapshotInterval:120000000000, SnapshotThreshold:0x2000, LeaderLeaseTimeout:2500000000, LocalID:\"vault-xyz-1\", NotifyCh:(chan<- bool)(0xc003802700), Log
Output:io.Writer(nil), LogLevel:\"DEBUG\", Logger:(*hclog.interceptLogger)(0xc00362bd40), NoSnapshotRestoreOnStart:true, skipStartup:false}"
storage.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:vault-abc-1 Address:10.x.x.10:8201} {Suffrage:Voter ID:vault-bcd-1 Address:10.x.x.
11:8201} {Suffrage:Nonvoter ID:vault-xyz-1 Address:10.x.x.9:8201}]"
core: successfully joined the raft cluster: leader_addr=http://10.x.x.11:8200

So far, so good you would imagine. The next message was a bit strange:

storage.raft: entering follower state: follower="Node at 10.x.x.9:8201 [Follower]" leader-address= leader-id=

There’s no leader-address or -id listed…

Then we got this one:

storage.raft: failed to get previous log: previous-index=628978498 last-index=1 error="log not found"

And that’s where it’s been sitting for the past 30 minutes. There is no apparent traffic to or from the server, so it’s looking like it’s not copying the raft db from the leader at this point in time.

vault operator raft list-peers shows the following:

Node            Address           State       Voter
----            -------           -----       -----
vault-abc-1    10.x.x.10:8201    follower    true
vault-bcd-1    10.x.x.11:8201    leader      true
vault-xyz-1    10.x.x.9:8201     follower    false

vault operator raft autopilot state shows:

Healthy:                         false
Failure Tolerance:               0
Leader:                          vault-hel1-1
      Name:              vault-abc-1
      Address:           10.x.x.10:8201
      Status:            voter
      Node Status:       alive
      Healthy:           true
      Last Contact:      2.778542043s
      Last Term:         1197
      Last Index:        628986081
      Version:           1.14.0
      Node Type:         voter
      Name:              vault-bcd-1
      Address:           10.x.x.11:8201
      Status:            leader
      Node Status:       alive
      Healthy:           true
      Last Contact:      0s
      Last Term:         1197
      Last Index:        628986104
      Version:           1.14.0
      Node Type:         voter
      Name:              vault-xyz-1
      Address:           10.x.x.9:8201
      Status:            non-voter
      Node Status:       alive
      Healthy:           false
      Last Contact:      16m7.9159257s
      Last Term:         0
      Last Index:        0
      Version:           1.14.0
      Node Type:         voter

The strange thing being that it shows version as being 1.14.0 which is very strange because a vault -version on the new node shows 1.15.2; it also seems the new node is not at all interested in talking to the leader.

The new node is also still sealed, running the unseal again (with the unseal keys the leader was once initialized with) leads to:

Error unsealing: Error making API request.

URL: PUT http://10.x.x.9:8200/v1/sys/unseal
Code: 400. Errors:

* Vault is not initialized

So. What the hell do I do now? The idea was that if any node goes down we can spin up a fresh node any time we want and join it to the cluster, this however seems to not be the case; we spun up a completely new instance, did the installation, did a vault join followed by an unseal (which it does seem to accept) and it goes right back to this weird fugue state where it just sits there seemingly doing nothing.

A quick check with iptraf does show some traffic between the node and the leader (on port 8201) but it’s moving at <10kb/sec - which means if the raft db really is 40+Gb in size, I’ll be here for another year before that’s ever transferred.

So. Questions are:

  1. how the hell does a raft db get that big because I’m pretty sure the freelist thing was introduced well before 1.14 (although the actual raft db may be from the first release that incorporated it),

  2. why does the node not join? At least, it looks like it may be joining but it’d help if vault status shows something about this, because for all intents and purposes we’re now sort of in the “is it going to work or not” limbo and I’m now at 2 nodes out of 3 operational which makes me nervous.

  3. is it at all possible to reload a node’s raft database using a snapshot? If so, how?