Vault node fail to join raft cluster

would like to get some insights into what is going on.
(note in below our internal DNS domain replaced with example.com to protect the innocent)

I have a 3 node cluster of vault using integrated raft, 3 distinct VMs. all configured in the same manner.

  • single CA self signed use for raft and stored on each node in /etc/vault.d/cert/raft/raft–ca.pem
  • each node has a unique cert-file and key-file signed by the “raft-ca”
  • each node has a cert and key for the listener “tcp” {} stanza.
  • the cert is signed by a corporate CA which is installed on the linux node (/etc/ssl/certs…)
  • the cert has CN of vault.service.example.com and SAN for each node’s name
  • this cert/key is then placed in /etc/vault.d/cert/service/vault-service{-key,}.pem

each node is specified in the “raft” stanza, etc. with the leader_ ca/cert/key set appropriately for each server.
(see below)

I had a working cluster but was in process of chainging host names and all certs are created specifying host names. At one point I “removed” vault3 from peers and then cleared out that peer’s data directory updated all configs and restarted cluster, etc. I end up with constenly receiving the following error on vault3

Aug 23 12:17:12 vault3 vault[131267]: 2024-08-23T12:17:12.653Z [INFO] core: attempting to join possible raft leader node: leader_addr=https://vault1.node.example.com:8200
Aug 23 12:17:12 vault3 vault[131267]: 2024-08-23T12:17:12.653Z [INFO] core: attempting to join possible raft leader node: leader_addr=https://vault2.node.example.com:8200
Aug 23 12:17:12 vault3 vault[131267]: 2024-08-23T12:17:12.653Z [INFO] core: attempting to join possible raft leader node: leader_addr=https://vault3.node.example.com:8200
Aug 23 12:17:12 vault3 vault[131267]: 2024-08-23T12:17:12.657Z [ERROR] core: failed to get raft challenge: leader_addr=https://vault3.node.example.com:8200 error=“error during raft bootstrap init call: Put "https://vault3.node.example.com:8200/v1/sys/storage/raft/bootstrap/challenge\”: tls: failed to verify certificate: x509: certificate signed by unknown authority"
Aug 23 12:17:12 vault3 vault[131267]: 2024-08-23T12:17:12.658Z [INFO] http: TLS handshake error from 10.10.1.39:52308: remote error: tls: bad certificate
Aug 23 12:17:12 vault3 vault[131267]: 2024-08-23T12:17:12.661Z [ERROR] core: failed to get raft challenge: leader_addr=https://vault1.node.example.com:8200 error=“error during raft bootstrap init call: Put "https://vault1.node.example.com:8200/v1/sys/storage/raft/bootstrap/challenge\”: tls: failed to verify certificate: x509: certificate signed by unknown authority"
Aug 23 12:17:12 vault3 vault[131267]: 2024-08-23T12:17:12.661Z [ERROR] core: failed to get raft challenge: leader_addr=https://vault2.node.example.com:8200 error=“error during raft bootstrap init call: Put "https://vault2.node.example.com:8200/v1/sys/storage/raft/bootstrap/challenge\”: tls: failed to verify certificate: x509: certificate signed by unknown authority"

There are the expected corresponding errors on vault1 and vault2 of the failure on the remote side.

Here is what I know:

  • vault1 and vault2 are maintaing the cluster with 1 or 2 as leaders. I can repeatedly restart the vault 1 or 2 and the leader role will switch.
  • accessing vault from the browser UI works and I can create secrets and the secret survives across restarts.
  • vault1, vault2, vault3 have the same config file with only change is each has it’s unique ID and it’s unique raft cert/keys and common CA

my understanding is the first attempt to join RAFT uses the (what I call) the “service” address – which is represented above in the “leader_addr” on port 8200 which is the same as what is defined in the listener “tcp”{} stanza and thus each is using the same cert/key with the cert configured for each SAN representing the 3 nodes and signed by corporate CA installed on the linux host CA directory.

  • i can from every node (vault1,vault2,vault3) usse openssl s_connect and curl -v and the certificate is accepted and verified. the md5sum of each cert and key on each of the 3 nodes is the same.

yet I still get the above error where vault3 attempts to join the raft cluster by finding the leader walking through each of the configured systems specified in the "storage raft{} stanza in the sub-stanza retry_join{} as per the error message above.

It does not matter if the leader is vault1 or vault2 same errors. In addition vault1, vault2 have vault3 in their attempts to contact a leader when those are restarted, etc.

Further vault3 will not unseal as it provides an error of “vault is not initialized” which was unexpected as (I thought) it should want to join the cluster per it’s config.

I have totally remove /opt/vault/data and tried to just join as a new system.

I can choose to just delete all /opt/vault/data on all 3 nodes and try to start over but given the above errors and all 3 nodes being configured the same and the machines all installed the same manner (via an ansible script) it’s unclear that will work and I wanted to understand the problem.

Any insight would be helpful or later today I’ll probably wipe all vault data/configs from the 3 nodes and uninstall vault and reinstall and start all over. but that feels like “just hope” and I’d prefer to understand what went wrong.