Vault rejoin cluster, Connection reset by peer

rob518183 · April 1, 2025, 1:26pm

I have a hashicorp Vault test cluster of 3 nodes, I have shut down 2 nodes to simulate a quorum loss. These vault instances are running in Docker (podman) containers.

I have restored the remaining node with a peers.json file.

Now I have started the 2 other nodes again, and I want to join them to the cluster with this peers.json:

[
{
“id”: “vlt902”,
“address”: “ip-of-server:8201”
},
{
“id”: “vlt903”,
“address”: “ip-of-server:8201”
}
]

When I use the FQDN, I get the error: “too many colons in address”, I use the IP address instead.
But when I restart the container, I get the following error:

==> Vault server configuration:

Administrative Namespace:
Api Address: FQDN:8200
Cgo: disabled
Cluster Address: FQDN:8201
Environment Variables: HOME, HOSTNAME, NAME, PATH, VAULT_ADDR, VAULT_CACERT, VERSION, container
Go Version: go1.23.6
Listener 1: tcp (addr: “0.0.0.0:8200”, cluster address: “0.0.0.0:8201”, disable_request_limiter: “false”, max_request_duration: “1m30s”, max_request_size: “33554432”, tls: “enabled”)
Log Level: info
Mlock: supported: true, enabled: false
Recovery Mode: false
Storage: raft (HA available)
Version: Vault v1.19.0, built 2025-03-04T12:36:40Z
Version Sha: 7eeafb6160d60ede73c1d95566b0c8ea54f3cb5a

==> Vault server started! Log data will stream in below:

2025-04-01T12:58:20.402Z [INFO] proxy environment: http_proxy=“” https_proxy=“” no_proxy=“”
2025-04-01T12:58:20.402Z [WARN] storage.raft.fsm: raft FSM db file has wider permissions than needed: needed=-rw------- existing=-rwxrwxrwx
2025-04-01T12:58:20.405Z [INFO] incrementing seal generation: generation=1
2025-04-01T12:58:20.405Z [INFO] core: Initializing version history cache for core
2025-04-01T12:58:20.405Z [INFO] events: Starting event system
2025-04-01T12:58:20.407Z [INFO] core: raft retry join initiated
2025-04-01T12:58:42.624Z [INFO] core.cluster-listener.tcp: starting listener: listener_address=0.0.0.0:8201
2025-04-01T12:58:42.625Z [INFO] core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2025-04-01T12:58:42.625Z [INFO] storage.raft: raft recovery initiated: recovery_file=peers.json
2025-04-01T12:58:42.625Z [INFO] storage.raft: raft recovery found new config: config=“{[{Voter vlt902 10.45.121.83:8201} {Voter vlt903 10.45.121.84:8201}]}”
2025-04-01T12:58:42.626Z [INFO] storage.raft: snapshot restore progress: id=bolt-snapshot last-index=254027 last-term=312 size-in-bytes=0 read-bytes=0 percent-complete=“NaN%”
2025-04-01T12:58:42.628Z [INFO] storage.raft: raft recovery deleted peers.json
2025-04-01T12:58:42.628Z [INFO] storage.raft: creating Raft: config=“&raft.Config{ProtocolVersion:3, HeartbeatTimeout:15000000000, ElectionTimeout:15000000000, CommitTimeout:50000000, MaxAppendEntries:64, BatchApplyCh:true, ShutdownOnRemove:true, TrailingLogs:0x2800, SnapshotInterval:120000000000, SnapshotThreshold:0x2000, LeaderLeaseTimeout:2500000000, LocalID:"vlt902", NotifyCh:(chan<- bool)(0xc003592080), LogOutput:io.Writer(nil), LogLevel:"DEBUG", Logger:(*hclog.interceptLogger)(0xc003373380), NoSnapshotRestoreOnStart:true, PreVoteDisabled:false, skipStartup:false}”
2025-04-01T12:58:42.629Z [INFO] storage.raft: initial configuration: index=1 servers=“[{Suffrage:Voter ID:vlt902 Address:10.45.121.83:8201} {Suffrage:Voter ID:vlt903 Address:10.45.121.84:8201}]”
2025-04-01T12:58:42.629Z [INFO] core: vault is unsealed
2025-04-01T12:58:42.629Z [INFO] storage.raft: entering follower state: follower=“Node at FQDB:8201 [Follower]” leader-address= leader-id=
2025-04-01T12:58:42.629Z [INFO] core: entering standby mode
2025-04-01T12:58:57.658Z [WARN] storage.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2025-04-01T12:58:57.659Z [INFO] storage.raft: entering candidate state: node=“Node at FQDN:8201 [Candidate]” term=315
2025-04-01T12:58:57.845Z [ERROR] storage.raft: failed to make requestVote RPC: target=“{Voter vlt903 10.45.121.84:8201}” error=“read tcp 10.89.0.16:59564->10.45.121.84:8201: read: connection reset by peer” term=315
2025-04-01T12:59:07.190Z [WARN] storage.raft: Election timeout reached, restarting election
2025-04-01T12:59:07.190Z [INFO] storage.raft: entering candidate state: node=“Node at FQDN:8201 [Candidate]” term=315
2025-04-01T12:59:07.374Z [ERROR] storage.raft: failed to make requestVote RPC: target=“{Voter vlt903 10.45.121.84:8201}” error=“read tcp 10.89.0.16:43784->10.45.121.84:8201: read: connection reset by peer” term=315

The ports are open in firewall-cmd, what am I missing here?

public (active)
target: default
icmp-block-inversion: no
interfaces: ens33
sources:
services: cockpit dhcpv6-client ssh
ports: 8200/tcp 8201/tcp
protocols:
forward: yes
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:

rob518183 · April 4, 2025, 12:05pm

The network problems seem solved for now, but the node still won’t join:

2025-04-04T11:48:21.667Z [WARN] storage.raft: Election timeout reached, restarting election
2025-04-04T11:48:21.668Z [INFO] storage.raft: entering candidate state: node=“Node at vlt902.net:8201 [Candidate]” term=319
2025-04-04T11:48:21.668Z [INFO] storage.raft: pre-vote campaign failed, waiting for election timeout: term=318 tally=1 refused=2 votesNeeded=2

node 1 gives me this:

2025-04-04T11:48:30.202Z [WARN] storage.raft: rejecting pre-vote request since node is not in configuration: from:8201=vlt902.net

EDIT: you’re supposed to add ALL nodes in the peers.json, then restart ALL nodes. Right. That still gives me an outage? These containers are pretty fast, but still.

Topic		Replies	Views
Raft rejoin problem after recovery mode Vault	5	979	July 10, 2020
[SOLVED] Unable to rejoin cluster after remove-peer Vault	8	2356	September 21, 2021
Vault dns connection generic error during vault unseal Vault	3	1152	November 2, 2021
[SOLVED] Advertised cluster address does not match cluster_addr setting Vault	1	469	September 22, 2021
Vault does not honor cluster_addr Vault	1	503	May 19, 2022

Vault rejoin cluster, Connection reset by peer

Related topics