[SOLVED] Unable to rejoin cluster after remove-peer

I’m working through an upgrade from a single-node filesystem backed vault to a 3 node raft backed cluster. Currently just doing due diligence and stepping through things in an isolated lab.

After migrating my first test host, I added in a second node, joining it manually to the cluster.
I then kicked the node out using vault operator raft remove-peer, and then rebuilt the whole machine using a script. (This left the node-id and mac address of the NIC the same).

Upon rebuild I attempted to manually join the second node to the cluster. It fails to (re)join the cluster, and I can’t figure out why or how to correct the problem.

I’ve verified connectivity between the two nodes (both can ping the other, and can connect to port 8200 via nc).

I’ve enabled debug logging on the running, unsealed first node, and on the one that won’t connect. Neither logs anything useful. The final message on the “joiner” is

Sep 16 13:30:21 dev-vault02 vault[88067]: 2021-09-16T09:30:21.614-0400 [INFO]  core: attempting to join possible raft leader node: leader_addr=https://dev-vault01.coldstorage.com:8200

Nothing after that. And nothing at the “joinee” at all.

I’ve seen some stuff that suggests maybe the node won’t be allowed to rejoin for 72 hours, but nothing conclusive. I don’t think it’s been quite 24 hours since I booted it. But that would be based on node-id, wouldn’t it? (I changed the node ID and attempted to join again, and still nothing.)

The wording here suggests maybe I’ve somehow capped the number of servers in the cluster to 1, but I don’t see any controls around max cluster size: https://www.vaultproject.io/docs/concepts/integrated-storage#removing-peers

Removing the peer will ensure the cluster stays at the desired size, and that quorum is maintained.

Can anyone explain what’s going on? Have I done something irreparable? Any help is appreciated!

Hard to say. Please include versions and config files with posts like this… lots of changes over time.

The join happens over the API port (8200) but the intra-cluster traffic is over 8201. Can they all talk on 8201?

As for timing, autopilot has some timing stuff but shouldn’t be an issue here.
You might try changing the node_id to a new name.

What does:
vault operator raft list-peers
and
vault operator raft autopilot state

As mentioned, I did change the node-id.

Sorry about the missing info. Version is vault 1.8.2.

Config files:
joinee-vault.hcl.txt (696 Bytes)
joiner-vault.hcl.txt (867 Bytes)

dev-vault02:~$ nc -zv 10.1.1.197 8201
Connection to 10.1.1.197 8201 port [tcp/*] succeeded!

and

dev-vault01:~$ nc -zv 10.1.1.198 8201
nc: connect to 10.1.1.198 port 8201 (tcp) failed: Connection refused
$ vault operator raft list-peers
Node                  Address            State     Voter
----                  -------            -----     -----
iid-focal-vault-01    10.1.1.197:8201    leader    true
$ vault operator raft autopilot state
Healthy:                      true
Failure Tolerance:            0
Leader:                       iid-focal-vault-01
Voters:
   iid-focal-vault-01
Servers:
   iid-focal-vault-01
      Name:            iid-focal-vault-01
      Address:         10.1.1.197:8201
      Status:          leader
      Node Status:     alive
      Healthy:         true
      Last Contact:    0s
      Last Term:       212
      Last Index:      7646

And I see that maybe that’s my problem! Cut and paste without noticing the details. Taking a closer look…

Hmm. I shut down vault and threw up an nc -l 8201 on 10.1.1.198, and was able to connect, so it’s not blocked. Is there something in that joiner-vault.hcl that looks wrong to you, and would mean it’s refusing connection from 10.1.1.197?

Hmmm. I’ve checked the audit logs now, and I see the sys/storage/raft/bootstrap/challenge and response at the leader. But there’s no follow-up /v1/sys/storage/raft/bootstrap/answer ever sent.

So either the response is headed off somewhere before reaching the new node, or the new node hangs somewhere after receiving the response but before attempting to respond (without hitting any failure condition that would cause it to log a message of any kind… I’ve kicked my logging up to trace now)

Okay… figured it out. After tracing the code and finding this: https://github.com/hashicorp/vault/blob/e79b35287b64b84732a6108f5ee8c7821df5e487/vault/raft.go#L901-L916

For some reason I didn’t think you needed to unseal each instance individually :man_facepalming: The first run through, I thought it got the unseal information automatically from the leader while it was initializing the storage. But apparently I was wrong.

It was just waiting for the unseal key threshold to be met. After unsealing directly against the new node, it joins up just fine.

Hmm. You should only have unsealed only the leader/first-node. The joining new node should be neither initialized nor unsealed (though if you have auto unseal set, it will try until it joins). Is that what you experienced?

Per operator raft - Command | Vault by HashiCorp

If raft is used for storage , the node must be joined before unsealing and the leader-api-addr argument must be provided.

And to clarify for later readers - you need to join on 8200, but nodes need to be able to talk ton 8200 and 8201.

Right. I had to join the new node, and then unseal it. I wasn’t doing that second step.

I had initially had a retry_join clause in my config (you can see it commented out in the config files), which would repeatedly retry joining the cluster, but didn’t say anything about waiting for itself to be unsealed (which would be nice to add I think). After removing that clause and doing the join with vault operator raft join ... , there was just no information other than it looked like it joined, but didn’t appear as a peer in the output of vault operator raft list-peers.