We are using the version 1.14 (also tried 1.13 and 1.12) docker image with docker compose to set up a raft Vault cluster. I should mention that we are hosting our two Vaults on two seperate hosts behind a treafik reverse proxy. Our config.hcl looks like that:
We start both vaults with exactly this configuration on both hosts, which the exception of the vault instance number (1 or 2).
What works:
the vaults start and are reachable over the defined addresses
we can initialize the first vault via UI (by using shamir)
we reach the second vault via UI and after unsealing it is shown in the UI of the first vault (which is the leader)
What does not work:
after unsealing the second vault the UI of this vault is stuck in the unsealing step
the second vault is also only shown as a non-voter in the UI of the first vault
when we kill the first vault, we cant access any vault as no vault becomes the leader
Error messages from the logs (Vault 1):
storage.raft: failed to appendEntries to: peer="{Nonvoter node1 hashi-vault-1.domain.tld}" error="dial tcp: address hashi-vault-1.domain.tld: missing port in address"
storage.raft: failed to heartbeat to: peer=hashi-vault-1.domain.tld backoff time=1.28s error="dial tcp: address hashi-vault-1.domain.tld: missing port in address"
Error messages from the logs (Vault 2):
core: failed to retry join raft cluster: retry=2s err="waiting for unseal keys to be supplied"
core: failed to retry join raft cluster: retry=2s err="failed to send answer to raft leader node: error bootstrapping cluster: cluster already has state"
core: failed to get raft challenge: leader_addr=https://hashi-vault-1.domain.tld error="error during raft bootstrap init call: context deadline exceeded"
Vault 2 also has this info which could be interesting:
[INFO] core: security barrier not initialized
Any help is appreciated! Please don’t hesitate to ask for further information (logs, system configuration, etc.).
This is an immediate red flag… the entire purpose of the cluster listener is inter-node communication, so binding it to localhost cannot be correct.
Setting the node_id is a bit of a yellow flag. There is no good reason to set this. Leave it unset, and vault will generate and store a UUID instead, which insulates you from mistakes managing the node ID. (But don’t change it for an existing cluster. Do it when recreating a cluster.)
This is incorrect. Your listener is on port 8200 not 443. Also, you have tls_disable set so this needs to be http not https.
This is incorrect. Your cluster listener is on port 8201 not 443, and the /cluster URL-path is also incorrect and should be deleted. (This one stays https though.)
Probably due to the bad cluster_addr settings.
Expected. In a 2-node Raft cluster, both nodes must be up for the cluster to be functional. This is normal for any consensus/quorum system. 3 nodes are required to tolerate a node failure.
Thanks for the detailed answer! The first two points make sense, we will change that. Regarding the ports however we are not 100% sure. We did that because we’re using traefik as a reverse proxy and traefik doesn’t allow ports 8000 and 8001, so we did this as a fix. Do you think there is a better way? Thanks again for your answer, that’s really helpful!
8000 and 8001 are irrelevant to Vault. Vault’s ports are 8200 and 8201.
Although I am not familiar with Traefik, it seems implausible that a general purpose piece of software would forbid specific port numbers.
Vault’s internal cluster communication on port 8201 should go nowhere near any proxies - it is strictly from one Vault node directly to another Vault node.
When Vault is run behind a reverse proxy, it is appropriate to set api_addr to the address of the reverse proxy, which may involve a different port number - see High Availability | Vault | HashiCorp Developer
And when we try to access hashi-vault-1.domain.tld/cluster (for all vaults) we get the following error, which is shown as a 404 error in the console:
What is supposed to happen when trying to access the address we set as the cluster address?
We still get the same errors, especially this one makes us wonder if this is a hint to the problem we’re having: storage.raft: failed to appendEntries to: peer="{Nonvoter node1 hashi-vault-1.domain.tld}" error="dial tcp: address hashi-vault-1.domain.tld: missing port in address"
We also get the following warning:
[WARN] storage.raft: heartbeat timeout reached, not part of a stable configuration or a non-voter, not triggering a leader election
Thanks for your response. We are almost there, our cluster is now working but we have one question left:
Why do we need to use https for the cluster_addr? I thought that if we disable TLS, we wouldn’t need https. We don’t have any certs so how would that work?
Because Vault always HTTPS for its internal clustering traffic - no exceptions. disable_tls does not apply. It generates its own internal certificates which the user is not allowed to override, nor expected to ever interact with.