Vault raft: Preventing server addition that would require removal of too many servers and cause cluster instability

Hello,

I am trying to setup a Vault HA raft cluster with three servers. I have initialized and unsealed the first server and it shows up as a leader with “vault operator raft-list-peers”. When I then try to join another server it seems to work, I start to unseal, and then get the error message in the title. Any idea what the error message means?

I am attempting to use TLS and a certificate and can provide the config files if it would help. Thanks

2 Likes

Are you following a guide?

Config + logs are always helpful :slight_smile:

The best example I could find is Vault 1.4 Integrated Storage Overview - YouTube since this mentions TLS and doesn’t involve auto-unsealing.

The pertinent error log is

May 25 13:24:57 x105 vault[11797]: 2021-05-25T13:24:57.554-0400 [ERROR] core: failed to retry join raft cluster: retry=2s
May 25 13:24:59 x105 vault[11797]: 2021-05-25T13:24:59.555-0400 [INFO]  core: security barrier not initialized
May 25 13:24:59 x105 vault[11797]: 2021-05-25T13:24:59.555-0400 [INFO]  core: attempting to join possible raft leader node: leader_addr=https://x104:8200
May 25 13:24:59 x105 vault[11797]: 2021-05-25T13:24:59.563-0400 [WARN]  core: join attempt failed: error="failed to send answer to raft leader node: Error making API request.
May 25 13:24:59 x105 vault[11797]: URL: PUT https://x104:8200/v1/sys/storage/raft/bootstrap/answer
May 25 13:24:59 x105 vault[11797]: Code: 500. Errors:
May 25 13:24:59 x105 vault[11797]: * Preventing server addition that would require removal of too many servers and cause cluster instability"

My leader config is

listener "tcp" {
  address = "(IP of x104):8200"
  tls_cert_file = "/etc/ssl/certs/fullchain.pem"
  tls_key_file  = "/etc/pki/tls/private/privkey.key"
}

storage "raft" {
  path = "/opt/raft"
  node_id = "raft_node1"
}

api_addr = "https://x104:8200"
cluster_addr = "https://x104:8201"
ui = true
disable_mlock = true

And the server attempting to join config is

storage "raft" {                                                    
  path = "/opt/raft"                                                
  node_id = "raft_node2"                                            
                                                                    
  retry_join {                                                      
    leader_api_addr = "https://x104:8200" 
    leader_ca_cert_file = "/etc/ssl/certs/fullchain.pem"            
    leader_client_cert_file = "/etc/ssl/certs/fullchain.pem"        
    leader_client_key_file = "/etc/pki/tls/private/privkey.key"     
  }                                                                 
}                                                                   
                                                                    
listener "tcp" {                                                    
  address     = "0.0.0.0:8200"                                      
  tls_cert_file = "/etc/ssl/certs/fullchain.pem"                    
  tls_key_file  = "/etc/pki/tls/private/privkey.key"                
}                                                                   
                                                                    
cluster_addr = "https://x104:8201"        
disable_mlock = true                                                
#ui = true                                                          
api_addr = "https://x105:8200"

The one thing I’m not sure about is the listener address for the leader config. Also, the unseal is now hanging and exceeded context deadline but the error log is the same. Thanks!

1 Like

This is telling the 2nd raft node that its address within the cluster is actually the first cluster’s hostname. I think this should be 105.

2 Likes

That was the only change needed thank you so much! I wish the documentation had examples of both a leader and follower config file but hopefully someone else will find this page if they run into the same error.

Good to hear.

What’s the page you only see the leader documented vs no follower config at? I can make that change.

I was referring to Vault HA Cluster with Integrated Storage | Vault - HashiCorp Learn page.

In the Retry Join section there is partial config file where the rest is snipped. Thinking back on it now I could have read the Server Configuration | Vault by HashiCorp page to see what cluster_addr represented but surely having the full config file would have made things clearer earlier. The lack of documentation isn’t as egregious as I thought it was.

I just ran into the very same error using Vault 1.8.0-rc2 on OpenBSD-amd64 6.9, and the same change to my configuration fixed it. Thank you.

Which stems from my confusion that a Vault HA cluster IP address is not an additional (virtual) router address, like for ex. in VRRP or CARP clusters.

Is that a configuration error that ‘vault operator diagnose -config /etc/vault/vault.hcl’ could detect early?