Configuring Vault HA

Hi,
I’m trying to configure vault HA.
I’ve got two hosts: vault-node-1 and vault-node-2. Both use raft as a storage, with the following config:

vault-node-1:

storage "raft" {
  path = "/u01/app/vault/data.raft"
  node_id = "vault-node-1"
  retry_join {
        leader_api_addr = "https://vault-node-2:8200"
        leader_ca_cert_file = "/usr/local/certs/rootCA.crt"
        leader_client_cert_file = "/u01/app/vault/server.crt"
        leader_client_key_file = "/u01/app/vault/server.key"
        }
}

cluster_addr="https://vault-node-1:8201"
api_addr="https://vault-node-1:8200"

default_lease_ttl = 7200
max_lease_ttl = 7200

listener "tcp" {
 address = "0.0.0.0:8200"
 tls_cert_file = "/u01/app/vault/server.crt"
 tls_key_file = "/u01/app/vault/server.key"
 tls_min_version = "tls12"
 }

vault-node-2:

 storage "raft" {
  path = "/u01/app/vault/data.raft"
  node_id = "vault-node-2"
  retry_join {
        leader_api_addr = "https://vault-node-1:8200"
        leader_ca_cert_file = "/usr/local/certs/rootCA.crt"
        leader_client_cert_file = "/u01/app/vault/server.crt"
        leader_client_key_file = "/u01/app/vault/server.key"
        }
}

cluster_addr="https://vault-node-2:8201"
api_addr="https://vault-node-2:8200"

default_lease_ttl = 7200
max_lease_ttl = 7200

listener "tcp" {
 address = "0.0.0.0:8200"
 tls_cert_file = "/u01/app/vault/server.crt"
 tls_key_file = "/u01/app/vault/server.key"
 tls_min_version = "tls12"
 }

The cluster seems to be running ok:

$ vault operator raft list-peers
Node            Address              State       Voter
----            -------              -----       -----
vault-node-1    vault-node-1:8201    leader      true
vault-node-2    vault-node-2:8201    follower    true

However, when I stop the leader node (vault-node-1), connect to the standby node (vault-node-2) and try to read a secret, I get the following error:

Get "https://vault-node-1:8200/v1/sys/internal/ui/mounts/secret/kr/test": dial tcp 10.0.1.23:8200: connect: connection refused

It seems that the standby didn’t become a primary.

Is there a way to force the standby to become the primary (and can this be automated)?

If you are using raft storage you must have an odd number of instances - so in your case 3.

Thanks, I’ve added a third node and that helped.

However, if I had to stay with two nodes, would it be possible to use the surviving node (in case of primary failure)?

Not if you are using raft for storage. Raft is a consensus based clustering system, which means a majority of instances have to be running for the cluster to be available. So that means:

1 instance - 1 needs to be running
2 instances - 2 need to be running
3 instances - 2 need to be running
4 instances - 3 need to be running
5 instances - 3 need to be running

As you can see for the case where there are only 1 or 2 instances for the cluster to work all instances need to be running - meaning there is no HA. For the case of 2 instances there is no point as it doesn’t gain anything above a single instance.

If you did have 2 instances and on failed Vault would stop operating. You would need to fix/rebuild/replace the broken instance and get it running & comminicating with the surviving instance before you’d get Vault working again.

If you are using a different HA capable storage backend (e.g. DynamoDB or a PostgreSQL) and are therefore just using Vault as a stateless layer things change. At that point Vault is using a leadership based clustering system, which just needs a single instance to be working - so having 2 instances is therefore viable.

Thank you for the explanation.

One more question - in a two-nodes scenario - if the leader fails - are there any way to “convert” the survivng node into a non-HA?

Or - can I create a raft snapshot using a recovery mode?

thanks

From https://developer.hashicorp.com/vault/docs/concepts/recovery-mode:

Recovery mode Vault automatically resizes the cluster to size 1.

This means that starting any node in recovery mode converts that node into a standalone cluster of 1 node. The effect of this remains after recovery mode is exited.

An advanced user comfortable with HashiCorp’s implementation of Raft could also use the https://developer.hashicorp.com/vault/docs/concepts/integrated-storage#manual-recovery-using-peers-json procedure to achieve the same result without using recovery mode.