Vault 1.9 Raft Cluster issue - leader down and follower shows failed to make requestVote RPC

I have a Vault 1.9 Raft cluster, two nodes.

vault01 status:

Key Value


Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 1
Threshold 1
Version 1.9.0
Storage Type raft
Cluster Name vault-cluster-50898c55
Cluster ID 248e73f2-0a23-e66d-9070-0eb7a5cb49b5
HA Enabled true
HA Cluster https://vault01:8201
HA Mode active
Active Since 2022-01-04T19:19:09.496657694Z
Raft Committed Index 129
Raft Applied Index 129

vault02 status:
Key Value


Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 1
Threshold 1
Version 1.9.0
Storage Type raft
Cluster Name vault-cluster-50898c55
Cluster ID 248e73f2-0a23-e66d-9070-0eb7a5cb49b5
HA Enabled true
HA Cluster https://vault01:8201
HA Mode standby
Active Node Address https://vault01:8200
Raft Committed Index 130
Raft Applied Index 129

Trying to simulate a failure, stopped vault01 service.

vault01 status:
Error checking seal status: Get “https://127.0.0.1:8200/v1/sys/seal-status”: dial tcp 127.0.0.1:8200: connect: connection refused

vault02 status:
Key Value


Recovery Seal Type shamir
Initialized true
Sealed false
Total Recovery Shares 1
Threshold 1
Version 1.9.0
Storage Type raft
Cluster Name vault-cluster-50898c55
Cluster ID 248e73f2-0a23-e66d-9070-0eb7a5cb49b5
HA Enabled true
HA Cluster https://vault01:8201
HA Mode standby
Active Node Address https://vault01:8200
Raft Committed Index 130
Raft Applied Index 129

UI shows a message:
This is a standby Vault node but can’t communicate with the active node via request forwarding. Sign in at the active node to use the Vault UI.

vault02 journalctl logs shows:
[WARN] storage.raft: Election timeout reached, restarting election
[INFO] storage.raft: entering candidate state: node=“Node at vault02:8201 [Candidate]” term=395
[ERROR] storage.raft: failed to make requestVote RPC: target="{Voter vault01 https://127.0.0.1:8201}" error=“dial tcp: address https://127.0.0.1:8201: too many colons in address”

“too many colons in address” looks like some error from raft module trying to split cluster address (- The Go Programming Language line 196).
Tried putting some [ ] in hcl config file for addresses but same issue.
Anyone had this before?
Thanks.

I believe you need at least 3 nodes in a RAFT cluster for HA to work properly. Odd numbers are preferred to maintain quorum.

Thanks Jeff, but what is calling my attention is the error message related to cluster address:

[ERROR] storage.raft: failed to make requestVote RPC: target="{Voter vault01 https://127.0.0.1:8201/}" error=“dial tcp: address https://127.0.0.1:8201: too many colons in address”

Can you provide the content of your config file?

I’m not seeing anything that jumps out at me that would be incorrect based on your above output.

However, if you don’t have enough nodes to maintain quorum you will get errors, although the “too many colons” error does seem strange and I can’t say I’ve run into it before.

Sure, here it is my vault.hcl file, both nodes are using same configuration, only changing the node_id and vm_name variable. Cert is a wildcard *.mydomain.com


ui = true

disable_mlock = true

plugin_directory = “/opt/vault/plugins”

storage “raft” {

path = "/opt/vault/data"

node_id = "vault01"

}

listener “tcp” {

address = “0.0.0.0:8200”

cluster_address = “0.0.0.0:8201”

tls_disable = 0

tls_cert_file = “/opt/vault/tls/cert.com.crt”

tls_key_file = “/opt/vault/tls/cert.com.key”

telemetry {

unauthenticated_metrics_access = true

}

}

api_addr = “https://${vm_name}:8200”

cluster_addr = “https://${vm_name}:8201”

telemetry {

disable_hostname = true

prometheus_retention_time = “24h”

}

seal “azurekeyvault” {

client_id = “${client_id}”

client_secret = “${client_secret}”

tenant_id = “${tenant_id}”

vault_name = “${vault_name}”

key_name = “${key_name}”

}

I’m not seeing anything unusual there.
Historically we’ve explicitly set the cluster_address to the node’s IP and desired port to mitigate some odd behavior in older versions of Vault where the port wasn’t setup right (not sure if it’s still a thing).

Perhaps that would be worth a try. I’d also be curious if you were to add a third node if you experience the same issue (would probably need to tear down two nodes to simulate in this case, though).

I think the error is that ${} variables are not getting replaced with go-templating and that’s turning into https://:8200 which then go is translating into https://127.0.0.1:8200 which is the cause of the error. Are these actual VMs? EC2? Like @jeffsanicola said, set the values to the IP address of the node and that should fix your errors … and I also concur on the number of nodes. You can’t have even number and cause a failure, raft won’t elect a new leader in that case or most likely it won’t.

thanks @jeffsanicola and @aram, I’ve setup a 3rd node and fixed some dns references and looks good now, stopping node1 then node2 start leading and no more “too many colons” errors.

Thanks again.