Edge Load Balancing for Consul Connect

Hello,
I am new to all this service mesh world and pattern, sorry if will write some nonsense.

We are planning to build service mesh using consul + envoy, it looked like complete solution. But after some reading and digging, it appeared that this combination is only for internal communication discovery and if you hit from outside like internet, envoy is not supported yet, as per here and here.

Possible solutions to enable edge load balancer with dynamic service discovery are:

I write here to get opinion / recommendation from Hashicorp representative on what approach to take, as they might be familiar with Roadmap.

Also any opinions are welcome, but I do not want to be tight to solution that just support Kubernetes.

Thank you.

Hi,
I don’t represent HashiCorp but can say that I was able to configure Envoy to operate as an Edge proxy to a Consul service mesh. Consul provides enough configurability via ‘Escape Hatches’ including overriding the Envoy bootstrap template to allow it to work.

You will have to become familiar with the Envoy configuration format which can be a bit of a steep learning curve, but probably worthwhile if you will be supporting Consul Connect / Envoy mesh in production.

I plan to write up the details of what I did, and I’ll share once that’s available.

Thanks,
Matt

2 Likes

Great question @vasilij-icabbi.

I wanted to cross post a response from a GitHub issue for folks looking here as it’s a common question understandably! Original (plus some extra clarification is here: https://github.com/hashicorp/consul/issues/5695#issuecomment-539185301)


Hi all, thanks for all of your feedback, just wanted to jump in to say this is certainly on our radar.

There are a few options that exist today and I’ll share some rough plans too for the near future.

Our original approach here is that we really don’t want to get sucked into building “North-South” edge proxy or API Gateway features into Connect at this point - that’s a whole other product line and while it has a lot of overlaps with service mesh (east-west) L7 traffic routing etc, there are also a lot of subtle and not-subtle differences to have a good solution (e.g. public TLS provisioning, User Authentication integrations, different security and traffic logging features etc).

So we aim to provide integrations with existing third-party “API Gateway” products that already have all of those features rather than attempt to compete with them and get pulled into building features that don’t benefit the core service-to-service use-case. It’s actually pretty easy to integrate just to get a client certificate and get traffic into Connect (L7 stuff is harder) so any proxy developers or people who already use a specific edge proxy and would consider building Connect support as a PR please let us know and we’d be happy to help guide that.

In that vein we’ve worked closely with the folks at Datawire who have shipped an integrations with Ambassador: https://www.getambassador.io/user-guide/consul/. We know that’s Kube-only but we are also working with many more vendors to have close integrations with them.

That said, we’ve recognized that aligning with eco system partners takes time. In many cases users are needing to bridge the gap a little - not full external “edge” or “north/south” API gateways into Connect, but more like internal gateways for service traffic between newer Connect-enabled in-mesh applications and older legacy systems that are not yet in the mesh for one reason or another.

So to help with both of those cases we are planning two things in the near term (no dates I’m afraid just yet but concrete plans on roadmap):

  1. Support for fetching Connect certificates in consul-template . That means it will be possible to easily configure edge proxies like HAProxy and Nginx the same way you always have via consul-template, but have them able to get Connect certs and so send traffic directly to connect-enabled backends without another hop through a side car. This will still leave users free to configure whatever they need for the “frontend” part of the edge proxy of course.
  2. A basic Envoy-based ingress proxy that allows you to expose a subset of services in the mesh without requiring mTLS authentication. This is not going to be designed as a full API gateway feature - by default it will just expose those services to anything that can access the proxy with no AuthZ.
    • We’ll probably allow “escape hatch” style raw Envoy config to be specified that allows injection of customer Filters/TLS config into the “public” listener of the ingress proxy which means it will be possible to configure custom auth, TLS certs, IP filtering or other rules while benefitting from automatic config for Connect identity and L7 routing to backends. Design TBD - this will be advanced functionality for those familiar with Envoy config but at least provides a way to benefit from the Connect backends without building your own thing from scratch.

Any feedback on this approach is welcome!

@sl1pm4t would be very interested to see a rough version of what you did with escape hatches. If you have time to share the config that would be really useful feedback for us (see last message for why!). Even raw configs as a gist would be great if writing up notes is too time consuming right away. Thanks!

@banks I’ve forked @nic’s consul-demo-traffic-splitting repo and added an Edge proxy example.

The two key configuration pieces of the example are:

  • the custom envoy.yaml configuration to configure the public listeners on port 80 / 443, and a static list of HTTP hosts / path routes.
  • the services.hcl file that defines the edge-proxy service in consul and all it’s upstream clusters.

It had been months since I first worked on this Edge proxy configuration and when I went back and looked at the config I realized I had not used any Consul escape hatches, but instead created a custom envoy.yaml to bootstrap Envoy. In the envoy.yaml, the envoy frontend listener is given an HTTP routing configuration that maps hostnames and/or paths to named upstream clusters. However the cluster definitions are not provided statically, and will be provided by Consul dynamically. This also allows Consul to configure envoy with the mTLS certs & keys for communication with the rest of the service mesh.

The tricky part is in knowing in advance what Consul will name the upstream clusters when it generates the envoy cluster configuration.
In Consul versions before 1.6, it was fairly predictable, and a upstream service named web would be represented by an Envoy cluster configuration called web. However in Consul 1.6+ the cluster name includes Consul cluster DC and UUID
e.g. web.default.dc1.internal.f9dd0678-47c1-dc0a-b2cd-6d424cf73955.consul

It is almost be possible to achieve what I’ve done by configuring a custom envoy_bootstrap_json_tpl escape hatch - however the cluster UUID is not one of the interpolated variables when Consul renders envoy_bootstrap_json_tpl making it difficult to generate the correct upstream cluster names.

Running the Example

Steps for spinning up the example stack using docker-compose:

1 - clone the repo

git clone https://github.com/sl1pm4t/consul-demo-traffic-splitting.git

2 - use docker-compose to bring up all containers:

cd consul-demo-traffic-splitting
git checkout edge-proxy
docker-compose up -d

3 - while testing this example I almost always hit this Consul bug which means the edge proxy service doesn’t get registered with Consul at startup, and subsequently the consul agent does not supply cluster configuration to envoy. This will be seen as a 503 response in step 4 below.
As a workaround, trigger a consul configuration reload:

docker exec -it consul-demo-traffic-splitting_edge_consul_1 consul reload

4 - attempt to browse / curl the edge proxy (listening on port 80 & 443)

Web Service

$ curl -k https://localhost:443/
Hello World
###Upstream Data: localhost:9091###
  Service V1

API Service

$ curl -k https://localhost:443/api
Service V1

Thanks for this, that’s great detail.

The cluster’s UUID should never change and can be seen if you hit the /v1/connect/ca/roots endpoint (TrustDomain field). So that might help when generating a custom template.

Overall what you’ve done here is very similar to what we plan to build as a “basic ingress” just without having to figure all this out for yourself.

The other downside of the approach taken here is that it will bypass any L7 routing in Consul 1.6.0 (Presumably you worked on this prior to that even being available so it’s understandable). If you want the Routing and splitting rules configured in Consul to be respected by the edge proxy then I’m not sure it’s possible at all right now since those are all injected dynamically so adding a static listener at bootstrap time like this won’t work.

It might be possible still to hook into the named routes from a custom listener like this and have domain/SNI based routing choose which upstream and route set to use but that would be pretty involved envoy config! I’ve not tried it so not 100% sure if it’s possible currently.

At any rate we plan to make this all easier so this is great context to see what you did here.

Question for anyone who gets here. Which of the following options would work for you?

  1. Expose all services through an “ingress” proxy on separate ports.
    • Pros: really simple to build and works for all protocols, the only option for TCP services
    • Cons: Need to expose N ports to access N services through the proxy
  2. Expose http(1/2/grpc) services through an “ingress” proxy on a single port, using Hostname to address the required service.
    • Pros: clean and natural, no need for an extra layer of routing edge->service to expose services.
    • Cons: Need to have external clients resolve Consul DNS so that we can provide the IP(s) of the ingress proxies for service-specific names. e.g. <service>.ingress.consul
  3. Expose http(1/2/grpc) services through an “ingress” proxy on a single port, using a path prefix to address the required service.
    • Pros: no need for external Consul DNS resolution - can just use proxy IPs or raw hostnames
    • Cons: need to have a whole new way to configure the mapping of path prefix to service (or just stick with /<servicename>) and likely change clients who didn’t know about the prefix before when they connected directly etc.

It occurs to me writing this up that we’ll always need 1 for non-HTTP and we could do both 2 and 3 pretty easily especially if we just use a convention rather than a whole new routing layer to map path prefix to service so it might be something we can leave up to users at runtime.

Option 2 is more work to build though so would be good to hear if people would need/use that option over the others if available.

Thanks, that’s how I’m currently grabbing the TrustDomain before rendering the envoy.yaml from a template.

You might be interested to know, I’m also using this value when I need to generate an mTLS key outside of Consul that will be used by devices to communicate with the mesh. Specifically, I’m generating a TLS key on an F5 load balancer (where it’s not possible to run the Consul Agent directly), and getting it signed by the Vault CA so the F5 can communicate directly with backend services. As you know, the Trust Domain is used in the mTLS certificates URI SAN (SPIFFE ID) so it is accepted / understood by the rest of the mesh.

Interesting Matt,

While that will work, it removes a bunch of the value from Connect’s CA management - the F5 becomes responsible for rotating that cert etc. and won’t automatically have it managed by Consul if it is getting it direct from Vault. We also use very short certs in Connect typically (72 hour lifetime) whereas that is unlikely to be possible.

Have you considered having the F5 still get it’s cert from a Consul agent running on another host? It’s ideal if it uses the same one or small pool as they are cached but that would alow it to not only get a cert but also long poll that agent to see when it needs to rotate if roots change or the cert expires etc.

Yes I would have preferred this approach, but I couldn’t find a way to get a signed cert out of Consul (without going through the Envoy xDS API) - is there a Consul HTTP API endpoint that takes a CSR and returns signed cert?

Yes there is!

https://www.consul.io/api/agent/connect.html#service-leaf-certificate

You need a valid ACL token that has service:write for the service you are trying to get a cert for but that and the /roots at the same level should be all you need to get certs to participate in Connect. This is how our built-in proxy and “Native” integration SDK work so it’s very much first-class (actually predates xDS support and has all the same underlying mechanisms for caching and rotation etc if you use blocking queries against the leaf and roots).