Hello everyone, I am testing Vault cluster failover process and I am facing a strange issue.
I have 3 cluster nodes , and 1 standalone node that I use for Auto-Unseal.
My current environment is on RHEL VM.
Vault config is as follows on 3 nodes, only difference is in values of
node_id, api_addr and cluster_addr which are respectively pointing to each particular node.
disable_mlock="true"
storage "raft" {
path = "/home/MyUser/vault/data"
node_id = "vault-server-1"
retry_join {
leader_tls_servername = "vault-server-1"
leader_api_addr = "https://vault-server-1:8200"
leader_ca_cert_file = "/home/MyUser/etc/vault-server-1.chain.pem"
leader_client_cert_file = "/home/MyUser/etc/vault-server-1.crt"
leader_client_key_file = "/home/MyUser/etc/vault-server-1.private.key"
}
retry_join {
leader_tls_servername = "vault-server-2"
leader_api_addr = "https://vault-server-2:8200"
leader_ca_cert_file = "/home/MyUser/etc/vault-server-1.chain.pem"
leader_client_cert_file = "/home/MyUser/etc/vault-server-1.crt"
leader_client_key_file = "/home/MyUser/etc/vault-server-1.private.key"
}
retry_join {
leader_tls_servername = "vault-server-3"
leader_api_addr = "https://vault-server-3:8200"
leader_ca_cert_file = "/home/MyUser/etc/vault-server-1.chain.pem"
leader_client_cert_file = "/home/MyUser/etc/vault-server-1.crt"
leader_client_key_file = "/home/MyUser/etc/vault-server-1.private.key"
}
}
seal "transit" {
address = "http://vault-server-0:8200"
token = "s.XXXXXXXXXXXXXXXX"
key_name = "unseal-key"
mount_path = "transit"
}
listener "tcp" {
address = "0.0.0.0:8200"
tls_cert_file = "/home/MyUser/etc/vault-server-1.crt"
tls_key_file = "/home/MyUser/etc/vault-server-1.private.key"
}
api_addr = "https://vault-server-1:8200"
cluster_addr = "https://vault-server-1:8201"
ui = true
So I have my 3 node cluster up and running with raft local storage, all good it synchs secrets across the nodes.
And I want to test a scenario when one or two nodes in the cluster become unavailable.
I shut down leader node let’s say Node 1, then I see leadership is passed to Node 2 and Node 3 is in Stand By mode.
> vault status
Key Value
--- -----
Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 5
Threshold 3
Version 1.9.3
Storage Type raft
Cluster Name vault-cluster-e11d29d8
Cluster ID efed540d-9ae5-a071-7a52-8679847446cb
HA Enabled true
HA Cluster https://vault-server-2:8201
HA Mode active
Active Since 2022-04-05T15:47:38.781226601Z
Raft Committed Index 553
Raft Applied Index 553
In this situation I am still able to get the secrets from Node 2 and Node 3 no issue there.
> vault kv get Test/MySecret/
======= Metadata =======
Key Value
--- -----
created_time 2022-03-31T18:12:43.686143075Z
custom_metadata <nil>
deletion_time n/a
destroyed false
version 3
======== Data ========
Key Value
--- -----
USER1 Password1000
USER2 Password1000
USER3 Password1000
Then I shut down Node 3 which is standby node leaving only active Node2 which is the leader.
And in this situation when I want to get the secrets from CLI I get this strange response.
> vault kv get Test/MySecret
nil response from pre-flight request
And this is what vault status shows me:
> vault status
Key Value
--- -----
Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 5
Threshold 3
Version 1.9.3
Storage Type raft
Cluster Name vault-cluster-e11d29d8
Cluster ID efed540d-9ae5-a071-7a52-8679847446cb
HA Enabled true
HA Cluster https://vault-server-2:8201
HA Mode standby
Active Node Address https://vault-server-2:8200
Raft Committed Index 553
Raft Applied Index 553
For some reason it becomes standby ?
When I bring up any of the other vault services up on the other nodes then Node 2 becomes Active again and get command retrieves the secrets just fine.
With 2 nodes down I get the following in log for vault:
Apr 5 18:12:06 vault-server-2 vault: 2022-04-05T18:12:06.361+0200 [INFO] storage.raft: entering candidate state: node="Node at vault-server-2:8201 [Candidate]" term=31579
Apr 5 18:12:06 vault-server-2 vault: 2022-04-05T18:12:06.365+0200 [ERROR] storage.raft: failed to make requestVote RPC: target="{Voter vault-server-1 vault-server-1:8201}" error="dial tcp 22.241.188.149:8201: connect: connection refused"
Apr 5 18:12:06 vault-server-2 vault: 2022-04-05T18:12:06.368+0200 [ERROR] storage.raft: failed to make requestVote RPC: target="{Voter vault-server-3 vault-server-3:8201}" error="dial tcp 22.241.115.227:8201: connect: connection refused"
Apr 5 18:12:11 vault-server-2 vault: 2022-04-05T18:12:11.640+0200 [WARN] storage.raft: Election timeout reached, restarting election
Anyone can advise why the node that was Active becomes Standby when other 2 nodes go down and what this nil response means ?
I can’t find any explanation of this in google or forums.
Thanks in advance.