I’m setting up a HA Vault in AWS with EC2 instances using an ALB as the only entry point.
In the docs I read
Note that only active nodes have active listeners. When a node becomes active it will start cluster listeners, and when it becomes standby it will stop them.
I interpret this as port 8201 being not listening for connections on the standby servers but that’s not the case how you can see in my screenshot
Q1: Is the documentation outdated or I misinterpreted something ?
So not being able to distinguish active from standby by using the availability of port 8201, I ended having the following configuration:
api_addr = "https://vault.uniqueos-stage.[redacted]"
cluster_addr = "https://vault.uniqueos-stage.[redacted]:8201"
The Load balancer has 2 Target Groups:
- port 443 -> port 8200 with health check reporting healthy on status 200 and 429
- port 8201 -> port 8201 with health check reporting healthy on status 200
This way the LB will redirect server-to-server communication only to the current active node
Q2: Is this the way one is supposed to be setting it up ?
The set up I documented doesn’t work because the ALB would terminate TLS also in the server-to-server communication on port 8201 which upsets the Vault.
So I think I came up with the right set up with request forwarding (no redirection) in AWS which doesn’t include the ALB anymore but rather the NLB.
Bear in mind, that this set up is not meant to be exposed to the internet: the balancer is internal to the VPC. We plan to access the Vault per VPN connection into the VPC. I suppose it’s not a great security concern to expose the 8201 to the public internet but if you’re paranoid enough you may want to set up a dedicated internal NLB for that (costs more of course)
This is the set up of the NLB:
- port 443 -> port 8200 with TCP health check on host port. Vault terminates TLS. Traffic is forwarded to all instances as long as port 8200 is open
- port 8201 -> port 8201 with HTTPS health check on the https://[host]:8201/v1/sys/health. Only the active node responds with 200 (the standby responds with 429) making the NLB relay all server-to-server traffic to the active node.
sorry, used wrong port in healthcheck. Correct is:
with HTTPS health check on the https://[host]:8200/v1/sys/health