Vault raft: Preventing server addition that would require removal of too many servers and cause cluster instability

HybridNeos · May 25, 2021, 4:39pm

Hello,

I am trying to setup a Vault HA raft cluster with three servers. I have initialized and unsealed the first server and it shows up as a leader with “vault operator raft-list-peers”. When I then try to join another server it seems to work, I start to unseal, and then get the error message in the title. Any idea what the error message means?

I am attempting to use TLS and a certificate and can provide the config files if it would help. Thanks

mikegreen · May 25, 2021, 5:06pm

Are you following a guide?

Config + logs are always helpful

HybridNeos · May 25, 2021, 5:38pm

The best example I could find is Vault 1.4 Integrated Storage Overview - YouTube since this mentions TLS and doesn’t involve auto-unsealing.

The pertinent error log is

May 25 13:24:57 x105 vault[11797]: 2021-05-25T13:24:57.554-0400 [ERROR] core: failed to retry join raft cluster: retry=2s
May 25 13:24:59 x105 vault[11797]: 2021-05-25T13:24:59.555-0400 [INFO]  core: security barrier not initialized
May 25 13:24:59 x105 vault[11797]: 2021-05-25T13:24:59.555-0400 [INFO]  core: attempting to join possible raft leader node: leader_addr=https://x104:8200
May 25 13:24:59 x105 vault[11797]: 2021-05-25T13:24:59.563-0400 [WARN]  core: join attempt failed: error="failed to send answer to raft leader node: Error making API request.
May 25 13:24:59 x105 vault[11797]: URL: PUT https://x104:8200/v1/sys/storage/raft/bootstrap/answer
May 25 13:24:59 x105 vault[11797]: Code: 500. Errors:
May 25 13:24:59 x105 vault[11797]: * Preventing server addition that would require removal of too many servers and cause cluster instability"

My leader config is

listener "tcp" {
  address = "(IP of x104):8200"
  tls_cert_file = "/etc/ssl/certs/fullchain.pem"
  tls_key_file  = "/etc/pki/tls/private/privkey.key"
}

storage "raft" {
  path = "/opt/raft"
  node_id = "raft_node1"
}

api_addr = "https://x104:8200"
cluster_addr = "https://x104:8201"
ui = true
disable_mlock = true

And the server attempting to join config is

storage "raft" {                                                    
  path = "/opt/raft"                                                
  node_id = "raft_node2"                                            
                                                                    
  retry_join {                                                      
    leader_api_addr = "https://x104:8200" 
    leader_ca_cert_file = "/etc/ssl/certs/fullchain.pem"            
    leader_client_cert_file = "/etc/ssl/certs/fullchain.pem"        
    leader_client_key_file = "/etc/pki/tls/private/privkey.key"     
  }                                                                 
}                                                                   
                                                                    
listener "tcp" {                                                    
  address     = "0.0.0.0:8200"                                      
  tls_cert_file = "/etc/ssl/certs/fullchain.pem"                    
  tls_key_file  = "/etc/pki/tls/private/privkey.key"                
}                                                                   
                                                                    
cluster_addr = "https://x104:8201"        
disable_mlock = true                                                
#ui = true                                                          
api_addr = "https://x105:8200"

The one thing I’m not sure about is the listener address for the leader config. Also, the unseal is now hanging and exceeded context deadline but the error log is the same. Thanks!

mikegreen · May 25, 2021, 8:43pm

This is telling the 2nd raft node that its address within the cluster is actually the first cluster’s hostname. I think this should be 105.

HybridNeos · May 25, 2021, 9:49pm

That was the only change needed thank you so much! I wish the documentation had examples of both a leader and follower config file but hopefully someone else will find this page if they run into the same error.

mikegreen · May 26, 2021, 12:11am

Good to hear.

What’s the page you only see the leader documented vs no follower config at? I can make that change.

HybridNeos · May 26, 2021, 1:21am

I was referring to Vault HA Cluster with Integrated Storage | Vault - HashiCorp Learn page.

In the Retry Join section there is partial config file where the rest is snipped. Thinking back on it now I could have read the Server Configuration | Vault by HashiCorp page to see what cluster_addr represented but surely having the full config file would have made things clearer earlier. The lack of documentation isn’t as egregious as I thought it was.

hb9cwp · July 27, 2021, 6:16pm

I just ran into the very same error using Vault 1.8.0-rc2 on OpenBSD-amd64 6.9, and the same change to my configuration fixed it. Thank you.

Which stems from my confusion that a Vault HA cluster IP address is not an additional (virtual) router address, like for ex. in VRRP or CARP clusters.

Is that a configuration error that ‘vault operator diagnose -config /etc/vault/vault.hcl’ could detect early?

Topic		Replies	Views
Trying to setup Vault HA mode with Raft Vault k8s , vault	4	1427	December 16, 2022
Can't add new vault nodes to existing raft cluster Vault	4	2519	December 31, 2021
Create raft ha_cluster. Certs error Vault	0	476	October 14, 2020
URL: PUT https://vault-0.vault-internal:8200/v1/sys/storage/raft/bootstrap/challenge Code: 503. Errors: Vault k8s	1	614	October 19, 2023
Vault operator raft join getting : [ERROR] core: failed to join raft cluster: error="failed to join any raft leader node" Vault k8s , connect , vault	3	11135	January 25, 2021

Vault raft: Preventing server addition that would require removal of too many servers and cause cluster instability

Related topics