Access to multiple AWS account VPCs

Hey folks,

I haven’t rolled out Boundary to any of my clients yet as I’m giving it a bit more time to bake, but I am really excited about the prospect of it. The problem of client team’s accessing private cluster resources is something I run into all the time, and the fact that I can use a Hashi product which is made for Terraform usage from the get go to solve it will be truly awesome!

Onto my question, I want to confirm my understanding of how this tool would work in an AWS multi-account / multi-vpc environment. What I understand I would need to do to accomplish this is the following:

  1. I would run the Boundary controllers in a HA fashion in a “system” VPC.
  2. I would run Boundary workers in each child / environment AWS account’s VPC.
  3. I would need to have VPC peering between my system VPC and my environment VPC’s.
  • Maybe this is not needed if I make the worker externally available to the Boundary Controllers? Is that a security concern? Obviously, VPC peering is a PITA so I’d prefer to avoid this.
  1. Each boundary worker which is hosted in a public subnet would need to have network access to the internal cluster resources that I want to expose to my client teams.

Is that understanding correct? Has anyone successfully implemented something similar and can speak to #3 above or anything I’m missing?

Thanks!

Thanks for trying out Boundary @Gowiem.

This is correct, the client needs a route to both the controllers and workers, and the workers need a route to the target. The important piece is “needs a route”, as long as that foundational aspect is taken care of, things should work regardless of how that route is established (VPC peering or otherwise).

A good starting place is our reference architecture: https://github.com/hashicorp/boundary-reference-architecture

Let me know if you have any other questions!

Gotcha @malnick, thanks for confirming!

I’ve checked the ref arch and while I think that helps conceptualize this tool well, I think it’s problematic from the perspective that most Boundary users are not going to only want access to one VPC or run an instance of the entire Boundary system per VPC. I guess that is some feedback for ref arch v2!

I’m going to jump on this thread since @Gowiem did such a good job of phrasing my own questions on this.

@malnick The reference architecture has quite a few layers which make a lot of sense. But when looking at deploying this in say 8 regions, with multiple vpc/accounts per region it opens up a lot of questions. I’ll try to elucidate a few:

  1. Your design says that a front-end LB is required, but it doesn’t indicate what problems the LB should be designed to solve (aside from availability, which is implied). Elucidation of the key concerns would help us choose the appropriate LB mechanism.

    • Should users be routed to the same controller whenever possible?
    • Are there limitations that would make GLB-style balancing inappropriate?
    • Or phrased another way, what should the LB do that round-robin DNS does not provide? (other than random-at-best balancing :wink: )
  2. It seems to suggest that you could have a single set of controllers, with workers in each region/vpc… but I don’t see anything about how to route requests to the right workers. Is there a way that I’m overlooking where any given worker might only have access to a subset of resources?

    I could see this aligning with projects, perhaps identifying workers that can service resources in a specific project?

  3. Given that the controllers share a postgres database, one assumes that the controllers need to be in the same region for latency purposes… but that’s not actually spelled out anywhere. Tell us about the database utilization… is it low enough utilization or unique enough queries that cross-region synchronization may not be a problem?

Obviously all of these questions are going to have big caveats like at this time and we can’t commit to it remaining this way but some visibility into what is known and planned for can help us do our own designs better.

I’ll share with you a bit of how this matters for us. As we have

  • multiple regions with completely independent implementations of our (non-boundary) stack per region

  • an assumption that the postgres DB can only be safely used by controllers in the same region

    • This limits us to having a minimum of one set of controllers per region
  • tightly controlled VPCs where only workers in a given VPC/subnet could access nodes

    • No obvious worker routing, means a new set of controllers per set of workers

Given that none of these implementations need more than a single worker for load (+1 for redundancy) then it appears that the implementation would have to be a pair of nodes per implementation that perform both controller+worker functionality… which is >50 nodes for initial rollout (and >25 independent Terraform modules)

I’m really hoping that you can help us better understand the needs such that a better design with more common/shared resources might be possible.

Thanks all for the feedback here. The reference architecture is not meant to cover anything other than the most basic use cases. It’s meant to help people conceptualize the basics of how Boundary can be used in a simple environment. They’re example environments, and we purposefully don’t try to hit on every possible configuration because everyone’s architecture is going to be different.

That being said, a couple of points I’d like to make:

  1. The LB: this is optional - you can use a LB or not, it’s really up to you. As a reference architecture, we felt it was nice to add as a way to show how a HA setup could work. But it’s certainly not a requirement for using Boundary.

  2. Routing to controllers: controllers are stateless, uses can be routed to any controller.

  3. Routing to workers: when a client requests a new session, the worker consumes a given job from the controller. and the client is then automatically connected through that worker to the end target. It doesn’t matter what worker the client connects through, but if the client looses connectivity with the worker, they will loose their session.

  4. DB latency: yes, it’s prudent to run a DB in the same network as the controllers for performance reasons. However, what is not performant for one end user may be fine for another so we leave this up to the operator to determine what the best architecture is for them.

1 Like

Hi @malnick! Thank you for the additional insight. Can I ask you if you can clarify this bit?

It doesn’t matter what worker the client connects through

In the hypothesis of a multi-VPC deployment, where one given worker may have access to only a subset of the registered targets, how would this work?

E.g.
Image a setup where we have 3 AWS accounts, “system”, “dev” and “prod”. “system” hosts the Boundary control plane, “dev” and “prod” each have one VPC with one Postgres db instance and one Boundary worker. “dev” and “prod” are completely isolated from each other and only users and the Boundary control plane can access them via the “system” account VPC. Would I be able to access each db independently even if the “dev” Boundary worker cannot access the “prod” Postgres? Or should every worker have access to ALL registered targets?

I hope my description is clear enough, happy to clarify if needed :sweat_smile:

In the hypothesis of a multi-VPC deployment, where one given worker may have access to only a subset of the registered targets, how would this work?

Workers are not currently aware of what targets they have access to. In the example, it’s possible that a client would establish a session through a worker in one VPC but the target is in another. This would result in failure if there’s not a route between those VPC’s.

That being said, “target-aware workers” is a good feature request - going to rope in our PM @PPacent to chime in on that.

1 Like

@MatteoJoliveau target-aware workers are being actively discussed amongst our team right now. We don’t yet have a timeline for this feature’s delivery but this is absolutely in our vision for Boundary. We recognize the need to handle similar situations to the one you cited where users need to route sessions to given targets through specific workers.

2 Likes

Thank you both, this is wonderful news!
We’re going to keep an eye on the roadmap and get back to it at a later date

or perhaps worker selection on the client or controller? It seems a worker is assigned by the controller and given to the client today.

I don’t think the workers need to be aware which targets they have, but that the controller and/or client need to know which worker set to use…

@jorhett I agree your terminology here is more precise but we are referring to the same logical requirement - that the control plane routes traffic to some targets through specific workers based on which workers have network access to the target.

For now, the only workaround is for all workers to have connectivity to all targets. However, we recognize this isn’t ideal for all use-cases and this is something we would like to address in the future.

2 Likes