I’m going to jump on this thread since @Gowiem did such a good job of phrasing my own questions on this.
@malnick The reference architecture has quite a few layers which make a lot of sense. But when looking at deploying this in say 8 regions, with multiple vpc/accounts per region it opens up a lot of questions. I’ll try to elucidate a few:
-
Your design says that a front-end LB is required, but it doesn’t indicate what problems the LB should be designed to solve (aside from availability, which is implied). Elucidation of the key concerns would help us choose the appropriate LB mechanism.
- Should users be routed to the same controller whenever possible?
- Are there limitations that would make GLB-style balancing inappropriate?
- Or phrased another way, what should the LB do that round-robin DNS does not provide? (other than random-at-best balancing
)
-
It seems to suggest that you could have a single set of controllers, with workers in each region/vpc… but I don’t see anything about how to route requests to the right workers. Is there a way that I’m overlooking where any given worker might only have access to a subset of resources?
I could see this aligning with projects, perhaps identifying workers that can service resources in a specific project?
-
Given that the controllers share a postgres database, one assumes that the controllers need to be in the same region for latency purposes… but that’s not actually spelled out anywhere. Tell us about the database utilization… is it low enough utilization or unique enough queries that cross-region synchronization may not be a problem?
Obviously all of these questions are going to have big caveats like at this time and we can’t commit to it remaining this way but some visibility into what is known and planned for can help us do our own designs better.
I’ll share with you a bit of how this matters for us. As we have
-
multiple regions with completely independent implementations of our (non-boundary) stack per region
-
an assumption that the postgres DB can only be safely used by controllers in the same region
- This limits us to having a minimum of one set of controllers per region
-
tightly controlled VPCs where only workers in a given VPC/subnet could access nodes
- No obvious worker routing, means a new set of controllers per set of workers
Given that none of these implementations need more than a single worker for load (+1 for redundancy) then it appears that the implementation would have to be a pair of nodes per implementation that perform both controller+worker functionality… which is >50 nodes for initial rollout (and >25 independent Terraform modules)
I’m really hoping that you can help us better understand the needs such that a better design with more common/shared resources might be possible.