Vault HA Failover using HAProxy

We are using a HA cluster of HashiCorp Vault with Integrated Storage. While the HA cluster is able to detect when the Active node is down and automatically promote a Standby node to be the new active node, various sources seem to suggest that a load balancing service is required to handle failover. An example of a thoroughly documented solution is to use HAProxy, but the documentation is for Consul, not Integrated Storage.

We have configured HAProxy to use the API endpoint at /v1/sys/health to determine which node is the Active node and redirect the requests to the IP address of the current Active node. Testing the connection between a Percona server and the Vault cluster through HAProxy reveals that failover functions quite well when active nodes are “stepped down” or shut down completely. Here is the following configuration for HAProxy for anyone that is curious:

global

defaults
     mode tcp
     timeout connect 5000ms
     timeout client 50000ms 
     timeout server 50000ms

frontend percona
     mode tcp
     bind <ip_address_of_ha_proxy_server:80>
     bind <ip_address_of_ha_proxy_server:443> ssl cert /path/to/cert
     redirect scheme https code 301 if !{ ssl_fc }
     log global
     option tcplog

backend vault
     mode tcp
     timeout check 5000
     timeout server 30000
     timeout connect 5000
     option httpchk GET /v1/sys/health
     http-check expect status 200
     server node1 <ip_address_of_vault_server_1> check ssl check-ssl verify none
     server node2 <ip_address_of_vault_server_2> check ssl check-ssl verify none
     server node3 <ip_address_of_vault_server_3> check ssl check-ssl verify none

Question: Since there seems to be very little documentation for using HAProxy with Integrated Storage to handle failover, we were wondering if this approach is best practice with regards to Vault failover? If there is something simpler or more reliable, would anyone be able to provide a link to the documentation?

1 Like

The storage backend is just that - a backend. It’s not relevant to how clients find and connect to your Vault cluster.

A load balancing service is generally suggested, so that a Vault node can go down without causing a service interruption - whereas, if you simply had clients picking a Vault node at random to talk to, and relying on the standby nodes forwarding requests to the active node, then an outage of one of the standby nodes would interrupt traffic that randomly selected it to talk to.

I think you’re confusing “HA” and “DR”. A single cluster has HA built-in – you do not need to handle failover within the cluster, and you don’t care which node is the active node. The Proxy simply rotates through the nodes for the most part … all you need to check is that the node is responding. The cluster itself (regardless of the storage/backend) will handle the transmission and write operations.

Now with “DR” there are operations which you need to monitor and do an actual fail over and it’s not simply switching the connection. Vault “DR” is not active-active, it’s active-standby which means after the primary fails you actually need to “promote” the cluster to become active. It actually a bit more complicated than that and best-practice from hashicorp is NOT to automate this process. Monitor and alert then do it as a manual failover.