Raft rejoin problem after recovery mode

webmutation · June 22, 2020, 5:36pm

Under https://learn.hashicorp.com/vault/operations/raft-storage-aws#resume-normal-operations
The Vault_3 is listed as the only member in the cluster.

I stopped and tried to force the join of the cluster by Vault_2 and Vault_4 but they will not joing Vault_3 the one where the recovery procedure is run.

Can you please clarify what needs to be done to return the cluster to its initial state of 3 nodes.

Cluster reset: When a node is brought up in recovery mode, it resets the list of cluster members. This means that when resuming normal operations, each node will need to rejoin the cluster.

I tried this but got the following errors back… from Vault_2 and Vault_4

Error joining the node to the raft cluster: Error making API request.

URL: POST http://127.0.0.1:8200/v1/sys/storage/raft/join
Code: 500. Errors:

* raft storage is already initialized

webmutation · June 30, 2020, 4:47pm

To follow up on this. Quorum is lost and not recovered.

According to the operations wiki, a manual recovery should be possible by creating a raft/peers.json file, however the format of this file is not described.

I was able to find it for nomad, I suppose it is the same ?

Wolfsrudel · June 30, 2020, 7:25pm

If it’s the same for Consul, too, I would suggest you are right.

For Raft protocol version 2 and earlier, this should be formatted as a JSON array containing the address and port of each Consul server in the cluster, like this:

["10.1.0.1:8300", "10.1.0.2:8300", "10.1.0.3:8300"]

For Raft protocol version 3 and later, this should be formatted as a JSON array containing the node ID, address:port, and suffrage information of each Consul server in the cluster, like this:

[
  {
    "id": "adf4238a-882b-9ddc-4a9d-5b6758e4159e",
    "address": "10.1.0.1:8300",
    "non_voter": false
  },
  {
    "id": "8b6dda82-3103-11e7-93ae-92361f002671",
    "address": "10.1.0.2:8300",
    "non_voter": false
  },
  {
    "id": "97e17742-3103-11e7-93ae-92361f002671",
    "address": "10.1.0.3:8300",
    "non_voter": false
  }
]

yhyakuna · July 8, 2020, 9:24pm

@webmutation - I added a step in the Vault HA Clustert with Integrated Storage but too lazy to add the same step in the Vault HA Cluster with Integrated Storage on AWS tutorial.

You need to clear out the old entries in the raft storage (
/vault/vault_3) once vault_3 was removed from the cluster before it can successfully re-join the cluster as a new member.

When vault_3 was removed from the cluster, it got disconnected from the leader; therefore, it no longer contains the up-to-date data. To successfully, join the cluster, Vault expects the vault_3’s raft storage to be empty, so that the leader can properly replicate the current data.

Hope this helps.

martinhristov90 · July 10, 2020, 1:07pm

Hello,

I have created this guide to solve similar issues, might be helpful in your case.

Martin

Wolfsrudel · July 10, 2020, 2:20pm

Uh, I don’t know the Help Center at all.

Topic		Replies	Views
Recover vault cluster with raft storage Vault raft	4	5101	January 29, 2021
[SOLVED] Unable to rejoin cluster after remove-peer Vault	8	2329	September 21, 2021
Vault Raft Stuck in Standby Mode Vault	2	1793	June 12, 2020
Raft snapshot restore issue Vault	6	1545	May 17, 2022
Vault rejoin cluster, Connection reset by peer Vault	1	31	April 4, 2025

Raft rejoin problem after recovery mode

Related topics