I’m working through an upgrade from a single-node filesystem backed vault to a 3 node raft backed cluster. Currently just doing due diligence and stepping through things in an isolated lab.
After migrating my first test host, I added in a second node, joining it manually to the cluster.
I then kicked the node out using vault operator raft remove-peer
, and then rebuilt the whole machine using a script. (This left the node-id and mac address of the NIC the same).
Upon rebuild I attempted to manually join the second node to the cluster. It fails to (re)join the cluster, and I can’t figure out why or how to correct the problem.
I’ve verified connectivity between the two nodes (both can ping the other, and can connect to port 8200
via nc
).
I’ve enabled debug logging on the running, unsealed first node, and on the one that won’t connect. Neither logs anything useful. The final message on the “joiner” is
Sep 16 13:30:21 dev-vault02 vault[88067]: 2021-09-16T09:30:21.614-0400 [INFO] core: attempting to join possible raft leader node: leader_addr=https://dev-vault01.coldstorage.com:8200
Nothing after that. And nothing at the “joinee” at all.
I’ve seen some stuff that suggests maybe the node won’t be allowed to rejoin for 72 hours, but nothing conclusive. I don’t think it’s been quite 24 hours since I booted it. But that would be based on node-id, wouldn’t it? (I changed the node ID and attempted to join again, and still nothing.)
The wording here suggests maybe I’ve somehow capped the number of servers in the cluster to 1, but I don’t see any controls around max cluster size: https://www.vaultproject.io/docs/concepts/integrated-storage#removing-peers
Removing the peer will ensure the cluster stays at the desired size, and that quorum is maintained.
Can anyone explain what’s going on? Have I done something irreparable? Any help is appreciated!