Raft rejoin problem after recovery mode

Under https://learn.hashicorp.com/vault/operations/raft-storage-aws#resume-normal-operations
The Vault_3 is listed as the only member in the cluster.

I stopped and tried to force the join of the cluster by Vault_2 and Vault_4 but they will not joing Vault_3 the one where the recovery procedure is run.

Can you please clarify what needs to be done to return the cluster to its initial state of 3 nodes.

Cluster reset: When a node is brought up in recovery mode, it resets the list of cluster members. This means that when resuming normal operations, each node will need to rejoin the cluster.

I tried this but got the following errors back… from Vault_2 and Vault_4

Error joining the node to the raft cluster: Error making API request.

URL: POST http://127.0.0.1:8200/v1/sys/storage/raft/join
Code: 500. Errors:

* raft storage is already initialized

To follow up on this. Quorum is lost and not recovered.

According to the operations wiki, a manual recovery should be possible by creating a raft/peers.json file, however the format of this file is not described.

I was able to find it for nomad, I suppose it is the same ?

If it’s the same for Consul, too, I would suggest you are right.

For Raft protocol version 2 and earlier, this should be formatted as a JSON array containing the address and port of each Consul server in the cluster, like this:

["10.1.0.1:8300", "10.1.0.2:8300", "10.1.0.3:8300"]

For Raft protocol version 3 and later, this should be formatted as a JSON array containing the node ID, address:port, and suffrage information of each Consul server in the cluster, like this:

[
  {
    "id": "adf4238a-882b-9ddc-4a9d-5b6758e4159e",
    "address": "10.1.0.1:8300",
    "non_voter": false
  },
  {
    "id": "8b6dda82-3103-11e7-93ae-92361f002671",
    "address": "10.1.0.2:8300",
    "non_voter": false
  },
  {
    "id": "97e17742-3103-11e7-93ae-92361f002671",
    "address": "10.1.0.3:8300",
    "non_voter": false
  }
]
1 Like

@webmutation - I added a step in the Vault HA Clustert with Integrated Storage but too lazy to add the same step in the Vault HA Cluster with Integrated Storage on AWS tutorial. :persevere:

You need to clear out the old entries in the raft storage (
/vault/vault_3) once vault_3 was removed from the cluster before it can successfully re-join the cluster as a new member.

When vault_3 was removed from the cluster, it got disconnected from the leader; therefore, it no longer contains the up-to-date data. To successfully, join the cluster, Vault expects the vault_3’s raft storage to be empty, so that the leader can properly replicate the current data.

Hope this helps.

2 Likes

Hello,

I have created this guide to solve similar issues, might be helpful in your case.

Martin

2 Likes

Uh, I don’t know the Help Center at all.