I have a 2-node vault cluster that is working properly
root@e21fea7c1f99:# vault operator raft list-peers
Node Address State Voter
---- ------- ----- -----
cf0dc62e-9193-bb64-19a4-5643b6f19517 172.16.0.38:8201 leader true
8219a3b2-f140-46f4-992d-5a4cf0acf791 172.16.0.43:8201 follower true
Once I kill the follower peer the leader peer steps down
2023-07-25T22:56:02.045Z [ERROR] storage.raft: failed to heartbeat to: peer=172.16.0.43:8201 backoff time=10ms error="dial tcp 172.16.0.43:8201: connect: connection refused"
2023-07-25T22:56:02.285Z [DEBUG] core.cluster-listener: creating rpc dialer: address=172.16.0.43:8201 alpn=raft_storage_v1 host=raft-82a6ee8c-995f-f593-02a9-7834fdd9478c
2023-07-25T22:56:02.286Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 8219a3b2-f140-46f4-992d-5a4cf0acf791 172.16.0.43:8201}" error="dial tcp 172.16.0.43:8201: connect: co
nnection refused"
2023-07-25T22:56:02.627Z [DEBUG] core.cluster-listener: creating rpc dialer: address=172.16.0.43:8201 alpn=raft_storage_v1 host=raft-82a6ee8c-995f-f593-02a9-7834fdd9478c
2023-07-25T22:56:02.627Z [ERROR] storage.raft: failed to heartbeat to: peer=172.16.0.43:8201 backoff time=20ms error="dial tcp 172.16.0.43:8201: connect: connection refused"
2023-07-25T22:56:02.864Z [WARN] storage.raft: failed to contact: server-id=8219a3b2-f140-46f4-992d-5a4cf0acf791 time=2.500850963s
2023-07-25T22:56:02.864Z [WARN] storage.raft: failed to contact quorum of nodes, stepping down
2023-07-25T22:56:02.864Z [INFO] storage.raft: entering follower state: follower="Node at 172.16.0.38:8201 [Follower]" leader-address= leader-id=
2023-07-25T22:56:02.865Z [WARN] core: leadership lost, stopping active operation
2023-07-25T22:56:02.865Z [INFO] core: pre-seal teardown starting
2023-07-25T22:56:02.865Z [DEBUG] storage.raft.autopilot: state update routine is now stopped
2023-07-25T22:56:02.865Z [DEBUG] storage.raft.autopilot: autopilot is now stopped
2023-07-25T22:56:03.365Z [INFO] core: stopping raft active node
2023-07-25T22:56:03.365Z [DEBUG] expiration: stop triggered
2023-07-25T22:56:03.365Z [TRACE] expiration.job-manager: terminating job manager...
2023-07-25T22:56:03.365Z [TRACE] expiration.job-manager: terminating dispatcher
2023-07-25T22:56:03.365Z [DEBUG] expiration: finished stopping
2023-07-25T22:56:03.366Z [INFO] rollback: stopping rollback manager
2023-07-25T22:56:03.366Z [INFO] core: pre-seal teardown complete
2023-07-25T22:56:03.366Z [ERROR] core: clearing leader advertisement failed: error="node is not the leader"
2023-07-25T22:56:03.366Z [ERROR] core: unlocking HA lock failed: error="node is not the leader"
2023-07-25T22:56:03.366Z [TRACE] core: found new active node information, refreshing
2023-07-25T22:56:03.404Z [DEBUG] core.cluster-listener: creating rpc dialer: address=172.16.0.43:8201 alpn=raft_storage_v1 host=raft-82a6ee8c-995f-f593-02a9-7834fdd9478c
2023-07-25T22:56:03.405Z [ERROR] storage.raft: failed to heartbeat to: peer=172.16.0.43:8201 backoff time=40ms error="dial tcp 172.16.0.43:8201: connect: connection refused"
2023-07-25T22:56:03.647Z [DEBUG] core.cluster-listener: creating rpc dialer: address=172.16.0.43:8201 alpn=raft_storage_v1 host=raft-82a6ee8c-995f-f593-02a9-7834fdd9478c
2023-07-25T22:56:03.648Z [ERROR] storage.raft: failed to appendEntries to: peer="{Voter 8219a3b2-f140-46f4-992d-5a4cf0acf791 172.16.0.43:8201}" error="dial tcp 172.16.0.43:8201: connect: connection refused"
all cluster APIs are stuck on the leader
root@e21fea7c1f99:# vault operator members
Error making API request.
URL: GET http://172.16.0.38:8200/v1/sys/ha-status
Code: 500. Errors:
* local node not active but active cluster node not found
root@e21fea7c1f99:# vault operator raft list-peers
Error reading the raft cluster configuration: Error making API request.
URL: GET http://172.16.0.38:8200/v1/sys/storage/raft/configuration
Code: 500. Errors:
* local node not active but active cluster node not found
the log has the following messages:
2023-07-25T22:59:20.573Z [TRACE] core: found new active node information, refreshing
2023-07-25T22:59:23.073Z [TRACE] core: found new active node information, refreshing
2023-07-25T22:59:25.574Z [TRACE] core: found new active node information, refreshing
2023-07-25T22:59:26.788Z [WARN] storage.raft: Election timeout reached, restarting election
2023-07-25T22:59:26.788Z [INFO] storage.raft: entering candidate state: node="Node at 172.16.0.38:8201 [Candidate]" term=325
2023-07-25T22:59:26.791Z [DEBUG] storage.raft: voting for self: term=325 id=cf0dc62e-9193-bb64-19a4-5643b6f19517
2023-07-25T22:59:26.795Z [DEBUG] storage.raft: asking for vote: term=325 from=8219a3b2-f140-46f4-992d-5a4cf0acf791 address=172.16.0.43:8201
2023-07-25T22:59:26.795Z [DEBUG] storage.raft: calculated votes needed: needed=2 term=325
2023-07-25T22:59:26.795Z [DEBUG] storage.raft: vote granted: from=cf0dc62e-9193-bb64-19a4-5643b6f19517 term=325 tally=1
2023-07-25T22:59:26.795Z [DEBUG] core.cluster-listener: creating rpc dialer: address=172.16.0.43:8201 alpn=raft_storage_v1 host=raft-82a6ee8c-995f-f593-02a9-7834fdd9478c
2023-07-25T22:59:26.795Z [ERROR] storage.raft: failed to make requestVote RPC: target="{Voter 8219a3b2-f140-46f4-992d-5a4cf0acf791 172.16.0.43:8201}" error="dial tcp 172.16.0.43:8201: connect: connection refused" term=325
2023-07-25T22:59:28.074Z [TRACE] core: found new active node information, refreshing
I don’t understand:
- Why there is a need for a new election when the follower died?
- Why leader cluster APIs are not working anymore?
I have looked into many previous issues but none matched my issue. Appreciate your help.