Boundary Workers behind NLB

Hi, I’ve deployed Boundary ( v0.12.0 ) workers behind an NLB and all have the same address.

I’m getting a error reading handshake result: failed to read protobuf message: failed to get reader: received close frame: status = StatusInternalError and reason = "refusing to activate session" in the client stderr.

The worker logs show, no tofu token but not in correct session state, in the workers that get a new connection.

Workers are behind an NLB cause the workers are in a private subnet which is allowed to reach the targets. Is this setup possible ?

You can’t put multiple Boundary workers behind a single LB address (or at least not that I’ve ever seen work – it fails in exactly the way you’re seeing), but if you’re using HCP Boundary you can expose only a single worker publicly (or use one of the existing HCP-managed workers for this purpose) and reverse-proxy the traffic to other workers through it via HCP Boundary 0.12’s multi-hop worker features.

Hey @omkensey, if you could spare a few more minutes on this… I have the same exact problem as @alejandro-belleza-gl in one of my clusters though a fairly similar setup in another cluster that actually works.

The main difference between them: in the one that works, the workers connect to the (controllers’) cluster port directly, i.e. initial_upstreams lists all the controller Pods hostnames. In the other one (that doesn’t work) the workers connect to the cluster port via Ingress (NLB) so, in that case, initial_upstreams is a single external DNS name which the Ingress controller routes to the specific Pods.

I did test the latter without the Ingress but it still doesn’t work so that leads me to believe that there may be differences in that behavior across versions. First cluster is running 0.10.5 and second is running 0.12.0

I just want to understand what exactly is happening internally that prevents it from working.

Just as extra context, if I scale down the number of replicas to 1, everything works as expected. When the number of replicas is greater than 1, if I attempt to connect to the same target multiple times, it does end up working after the Ingress LB has rotated through all the possible workers and connects to the worker that was the first one to be authorized by a controller. (it’s kind of an odd behavior to be honest, that each database connection uses a different worker than the one that was first authorized for that target - though I understand it may be necessary for spreading the load.)

Any help would be appreciated

@alejandro-belleza-gl could you clarify whether your controllers are directly accessible from the workers or if they are also behind some NAT device?

For reference, we have submitted a bug in the project repo.

1 Like

Access to the controllers cluster port is LB with an NLB.

This issue with the version 0.12.0 was fixed in this PR - Use tofu token from controller by irenarindos · Pull Request #3064 · hashicorp/boundary · GitHub

I’ve tested the build and works without issues. N workers behind an NLB.

Thanks for your help @macmiranda ! The details that you provided in the github issue were critical to identify the bug.

1 Like

Awesome! 0.12.1 will be out soon containing this fix.