Vault 1.9 Raft Cluster issue - leader down and follower shows failed to make requestVote RPC

Cesarfgc · January 4, 2022, 7:44pm

I have a Vault 1.9 Raft cluster, two nodes.

vault01 status:

Key Value

Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 1
Threshold 1
Version 1.9.0
Storage Type raft
Cluster Name vault-cluster-50898c55
Cluster ID 248e73f2-0a23-e66d-9070-0eb7a5cb49b5
HA Enabled true
HA Cluster https://vault01:8201
HA Mode active
Active Since 2022-01-04T19:19:09.496657694Z
Raft Committed Index 129
Raft Applied Index 129

vault02 status:
Key Value

Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 1
Threshold 1
Version 1.9.0
Storage Type raft
Cluster Name vault-cluster-50898c55
Cluster ID 248e73f2-0a23-e66d-9070-0eb7a5cb49b5
HA Enabled true
HA Cluster https://vault01:8201
HA Mode standby
Active Node Address https://vault01:8200
Raft Committed Index 130
Raft Applied Index 129

Trying to simulate a failure, stopped vault01 service.

vault01 status:
Error checking seal status: Get “https://127.0.0.1:8200/v1/sys/seal-status”: dial tcp 127.0.0.1:8200: connect: connection refused

vault02 status:
Key Value

Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 1
Threshold 1
Version 1.9.0
Storage Type raft
Cluster Name vault-cluster-50898c55
Cluster ID 248e73f2-0a23-e66d-9070-0eb7a5cb49b5
HA Enabled true
HA Cluster https://vault01:8201
HA Mode standby
Active Node Address https://vault01:8200
Raft Committed Index 130
Raft Applied Index 129

UI shows a message:
This is a standby Vault node but can’t communicate with the active node via request forwarding. Sign in at the active node to use the Vault UI.

vault02 journalctl logs shows:
[WARN] storage.raft: Election timeout reached, restarting election
[INFO] storage.raft: entering candidate state: node=“Node at vault02:8201 [Candidate]” term=395
[ERROR] storage.raft: failed to make requestVote RPC: target="{Voter vault01 https://127.0.0.1:8201}" error=“dial tcp: address https://127.0.0.1:8201: too many colons in address”

“too many colons in address” looks like some error from raft module trying to split cluster address (- The Go Programming Language line 196).
Tried putting some [ ] in hcl config file for addresses but same issue.
Anyone had this before?
Thanks.

jeffsanicola · January 4, 2022, 8:56pm

I believe you need at least 3 nodes in a RAFT cluster for HA to work properly. Odd numbers are preferred to maintain quorum.

Cesarfgc · January 4, 2022, 9:10pm

Thanks Jeff, but what is calling my attention is the error message related to cluster address:

[ERROR] storage.raft: failed to make requestVote RPC: target="{Voter vault01 https://127.0.0.1:8201/}" error=“dial tcp: address https://127.0.0.1:8201: too many colons in address”

jeffsanicola · January 4, 2022, 9:22pm

Can you provide the content of your config file?

I’m not seeing anything that jumps out at me that would be incorrect based on your above output.

However, if you don’t have enough nodes to maintain quorum you will get errors, although the “too many colons” error does seem strange and I can’t say I’ve run into it before.

Cesarfgc · January 4, 2022, 9:55pm

Sure, here it is my vault.hcl file, both nodes are using same configuration, only changing the node_id and vm_name variable. Cert is a wildcard *.mydomain.com

ui = true

disable_mlock = true

plugin_directory = “/opt/vault/plugins”

storage “raft” {

path = "/opt/vault/data"

node_id = "vault01"

}

listener “tcp” {

address = “0.0.0.0:8200”

cluster_address = “0.0.0.0:8201”

tls_disable = 0

tls_cert_file = “/opt/vault/tls/cert.com.crt”

tls_key_file = “/opt/vault/tls/cert.com.key”

telemetry {

unauthenticated_metrics_access = true

}

api_addr = “https://${vm_name}:8200”

cluster_addr = “https://${vm_name}:8201”

telemetry {

disable_hostname = true

prometheus_retention_time = “24h”

}

seal “azurekeyvault” {

client_id = “${client_id}”

client_secret = “${client_secret}”

tenant_id = “${tenant_id}”

vault_name = “${vault_name}”

key_name = “${key_name}”

}

jeffsanicola · January 4, 2022, 10:04pm

I’m not seeing anything unusual there.
Historically we’ve explicitly set the cluster_address to the node’s IP and desired port to mitigate some odd behavior in older versions of Vault where the port wasn’t setup right (not sure if it’s still a thing).

Perhaps that would be worth a try. I’d also be curious if you were to add a third node if you experience the same issue (would probably need to tear down two nodes to simulate in this case, though).

aram · January 4, 2022, 10:35pm

I think the error is that ${} variables are not getting replaced with go-templating and that’s turning into https://:8200 which then go is translating into https://127.0.0.1:8200 which is the cause of the error. Are these actual VMs? EC2? Like @jeffsanicola said, set the values to the IP address of the node and that should fix your errors … and I also concur on the number of nodes. You can’t have even number and cause a failure, raft won’t elect a new leader in that case or most likely it won’t.

Cesarfgc · January 5, 2022, 1:06pm

thanks @jeffsanicola and @aram, I’ve setup a 3rd node and fixed some dns references and looks good now, stopping node1 then node2 start leading and no more “too many colons” errors.

Thanks again.

Topic		Replies	Views
Vault joins as non-voter in a Raft cluster Vault	10	2121	August 9, 2023
Vault raft: Preventing server addition that would require removal of too many servers and cause cluster instability Vault raft	7	2976	July 27, 2021
Vault (raft) not reaching out to each other Vault raft , vault	1	383	January 20, 2023
Can't add new vault nodes to existing raft cluster Vault	4	2694	December 31, 2021
Issue to configure vault nodes Vault	12	2793	March 8, 2022

Vault 1.9 Raft Cluster issue - leader down and follower shows failed to make requestVote RPC

Related topics