Edge Load Balancing for Consul Connect

Hello,
I am new to all this service mesh world and pattern, sorry if will write some nonsense.

We are planning to build service mesh using consul + envoy, it looked like complete solution. But after some reading and digging, it appeared that this combination is only for internal communication discovery and if you hit from outside like internet, envoy is not supported yet, as per here and here.

Possible solutions to enable edge load balancer with dynamic service discovery are:

I write here to get opinion / recommendation from Hashicorp representative on what approach to take, as they might be familiar with Roadmap.

Also any opinions are welcome, but I do not want to be tight to solution that just support Kubernetes.

Thank you.

1 Like

Hi,
I donā€™t represent HashiCorp but can say that I was able to configure Envoy to operate as an Edge proxy to a Consul service mesh. Consul provides enough configurability via ā€˜Escape Hatchesā€™ including overriding the Envoy bootstrap template to allow it to work.

You will have to become familiar with the Envoy configuration format which can be a bit of a steep learning curve, but probably worthwhile if you will be supporting Consul Connect / Envoy mesh in production.

I plan to write up the details of what I did, and Iā€™ll share once thatā€™s available.

Thanks,
Matt

2 Likes

Great question @vasilij-icabbi.

I wanted to cross post a response from a GitHub issue for folks looking here as itā€™s a common question understandably! Original (plus some extra clarification is here: https://github.com/hashicorp/consul/issues/5695#issuecomment-539185301)


Hi all, thanks for all of your feedback, just wanted to jump in to say this is certainly on our radar.

There are a few options that exist today and Iā€™ll share some rough plans too for the near future.

Our original approach here is that we really donā€™t want to get sucked into building ā€œNorth-Southā€ edge proxy or API Gateway features into Connect at this point - thatā€™s a whole other product line and while it has a lot of overlaps with service mesh (east-west) L7 traffic routing etc, there are also a lot of subtle and not-subtle differences to have a good solution (e.g. public TLS provisioning, User Authentication integrations, different security and traffic logging features etc).

So we aim to provide integrations with existing third-party ā€œAPI Gatewayā€ products that already have all of those features rather than attempt to compete with them and get pulled into building features that donā€™t benefit the core service-to-service use-case. Itā€™s actually pretty easy to integrate just to get a client certificate and get traffic into Connect (L7 stuff is harder) so any proxy developers or people who already use a specific edge proxy and would consider building Connect support as a PR please let us know and weā€™d be happy to help guide that.

In that vein weā€™ve worked closely with the folks at Datawire who have shipped an integrations with Ambassador: https://www.getambassador.io/user-guide/consul/. We know thatā€™s Kube-only but we are also working with many more vendors to have close integrations with them.

That said, weā€™ve recognized that aligning with eco system partners takes time. In many cases users are needing to bridge the gap a little - not full external ā€œedgeā€ or ā€œnorth/southā€ API gateways into Connect, but more like internal gateways for service traffic between newer Connect-enabled in-mesh applications and older legacy systems that are not yet in the mesh for one reason or another.

So to help with both of those cases we are planning two things in the near term (no dates Iā€™m afraid just yet but concrete plans on roadmap):

  1. Support for fetching Connect certificates in consul-template . That means it will be possible to easily configure edge proxies like HAProxy and Nginx the same way you always have via consul-template, but have them able to get Connect certs and so send traffic directly to connect-enabled backends without another hop through a side car. This will still leave users free to configure whatever they need for the ā€œfrontendā€ part of the edge proxy of course.
  2. A basic Envoy-based ingress proxy that allows you to expose a subset of services in the mesh without requiring mTLS authentication. This is not going to be designed as a full API gateway feature - by default it will just expose those services to anything that can access the proxy with no AuthZ.
    • Weā€™ll probably allow ā€œescape hatchā€ style raw Envoy config to be specified that allows injection of customer Filters/TLS config into the ā€œpublicā€ listener of the ingress proxy which means it will be possible to configure custom auth, TLS certs, IP filtering or other rules while benefitting from automatic config for Connect identity and L7 routing to backends. Design TBD - this will be advanced functionality for those familiar with Envoy config but at least provides a way to benefit from the Connect backends without building your own thing from scratch.

Any feedback on this approach is welcome!

1 Like

@sl1pm4t would be very interested to see a rough version of what you did with escape hatches. If you have time to share the config that would be really useful feedback for us (see last message for why!). Even raw configs as a gist would be great if writing up notes is too time consuming right away. Thanks!

@banks Iā€™ve forked @nicā€™s consul-demo-traffic-splitting repo and added an Edge proxy example.

The two key configuration pieces of the example are:

  • the custom envoy.yaml configuration to configure the public listeners on port 80 / 443, and a static list of HTTP hosts / path routes.
  • the services.hcl file that defines the edge-proxy service in consul and all itā€™s upstream clusters.

It had been months since I first worked on this Edge proxy configuration and when I went back and looked at the config I realized I had not used any Consul escape hatches, but instead created a custom envoy.yaml to bootstrap Envoy. In the envoy.yaml, the envoy frontend listener is given an HTTP routing configuration that maps hostnames and/or paths to named upstream clusters. However the cluster definitions are not provided statically, and will be provided by Consul dynamically. This also allows Consul to configure envoy with the mTLS certs & keys for communication with the rest of the service mesh.

The tricky part is in knowing in advance what Consul will name the upstream clusters when it generates the envoy cluster configuration.
In Consul versions before 1.6, it was fairly predictable, and a upstream service named web would be represented by an Envoy cluster configuration called web. However in Consul 1.6+ the cluster name includes Consul cluster DC and UUID
e.g. web.default.dc1.internal.f9dd0678-47c1-dc0a-b2cd-6d424cf73955.consul

It is almost be possible to achieve what Iā€™ve done by configuring a custom envoy_bootstrap_json_tpl escape hatch - however the cluster UUID is not one of the interpolated variables when Consul renders envoy_bootstrap_json_tpl making it difficult to generate the correct upstream cluster names.

Running the Example

Steps for spinning up the example stack using docker-compose:

1 - clone the repo

git clone https://github.com/sl1pm4t/consul-demo-traffic-splitting.git

2 - use docker-compose to bring up all containers:

cd consul-demo-traffic-splitting
git checkout edge-proxy
docker-compose up -d

3 - while testing this example I almost always hit this Consul bug which means the edge proxy service doesnā€™t get registered with Consul at startup, and subsequently the consul agent does not supply cluster configuration to envoy. This will be seen as a 503 response in step 4 below.
As a workaround, trigger a consul configuration reload:

docker exec -it consul-demo-traffic-splitting_edge_consul_1 consul reload

4 - attempt to browse / curl the edge proxy (listening on port 80 & 443)

Web Service

$ curl -k https://localhost:443/
Hello World
###Upstream Data: localhost:9091###
  Service V1

API Service

$ curl -k https://localhost:443/api
Service V1
1 Like

Thanks for this, thatā€™s great detail.

The clusterā€™s UUID should never change and can be seen if you hit the /v1/connect/ca/roots endpoint (TrustDomain field). So that might help when generating a custom template.

Overall what youā€™ve done here is very similar to what we plan to build as a ā€œbasic ingressā€ just without having to figure all this out for yourself.

The other downside of the approach taken here is that it will bypass any L7 routing in Consul 1.6.0 (Presumably you worked on this prior to that even being available so itā€™s understandable). If you want the Routing and splitting rules configured in Consul to be respected by the edge proxy then Iā€™m not sure itā€™s possible at all right now since those are all injected dynamically so adding a static listener at bootstrap time like this wonā€™t work.

It might be possible still to hook into the named routes from a custom listener like this and have domain/SNI based routing choose which upstream and route set to use but that would be pretty involved envoy config! Iā€™ve not tried it so not 100% sure if itā€™s possible currently.

At any rate we plan to make this all easier so this is great context to see what you did here.

Question for anyone who gets here. Which of the following options would work for you?

  1. Expose all services through an ā€œingressā€ proxy on separate ports.
    • Pros: really simple to build and works for all protocols, the only option for TCP services
    • Cons: Need to expose N ports to access N services through the proxy
  2. Expose http(1/2/grpc) services through an ā€œingressā€ proxy on a single port, using Hostname to address the required service.
    • Pros: clean and natural, no need for an extra layer of routing edge->service to expose services.
    • Cons: Need to have external clients resolve Consul DNS so that we can provide the IP(s) of the ingress proxies for service-specific names. e.g. <service>.ingress.consul
  3. Expose http(1/2/grpc) services through an ā€œingressā€ proxy on a single port, using a path prefix to address the required service.
    • Pros: no need for external Consul DNS resolution - can just use proxy IPs or raw hostnames
    • Cons: need to have a whole new way to configure the mapping of path prefix to service (or just stick with /<servicename>) and likely change clients who didnā€™t know about the prefix before when they connected directly etc.

It occurs to me writing this up that weā€™ll always need 1 for non-HTTP and we could do both 2 and 3 pretty easily especially if we just use a convention rather than a whole new routing layer to map path prefix to service so it might be something we can leave up to users at runtime.

Option 2 is more work to build though so would be good to hear if people would need/use that option over the others if available.

Thanks, thatā€™s how Iā€™m currently grabbing the TrustDomain before rendering the envoy.yaml from a template.

You might be interested to know, Iā€™m also using this value when I need to generate an mTLS key outside of Consul that will be used by devices to communicate with the mesh. Specifically, Iā€™m generating a TLS key on an F5 load balancer (where itā€™s not possible to run the Consul Agent directly), and getting it signed by the Vault CA so the F5 can communicate directly with backend services. As you know, the Trust Domain is used in the mTLS certificates URI SAN (SPIFFE ID) so it is accepted / understood by the rest of the mesh.

Interesting Matt,

While that will work, it removes a bunch of the value from Connectā€™s CA management - the F5 becomes responsible for rotating that cert etc. and wonā€™t automatically have it managed by Consul if it is getting it direct from Vault. We also use very short certs in Connect typically (72 hour lifetime) whereas that is unlikely to be possible.

Have you considered having the F5 still get itā€™s cert from a Consul agent running on another host? Itā€™s ideal if it uses the same one or small pool as they are cached but that would alow it to not only get a cert but also long poll that agent to see when it needs to rotate if roots change or the cert expires etc.

Yes I would have preferred this approach, but I couldnā€™t find a way to get a signed cert out of Consul (without going through the Envoy xDS API) - is there a Consul HTTP API endpoint that takes a CSR and returns signed cert?

Yes there is!

https://www.consul.io/api/agent/connect.html#service-leaf-certificate

You need a valid ACL token that has service:write for the service you are trying to get a cert for but that and the /roots at the same level should be all you need to get certs to participate in Connect. This is how our built-in proxy and ā€œNativeā€ integration SDK work so itā€™s very much first-class (actually predates xDS support and has all the same underlying mechanisms for caching and rotation etc if you use blocking queries against the leaf and roots).

Hard to say as all options have usecase and need, but I would say option 3 and 1 are more what I would look for, where option 3 is more suits current use cases. :thinking:

Hi Paul,

Thank you for these proposals.

My primary use case would be mainly option 3 to handle http traffic, then option 1 to expose specific tcp/udp services (not all by default).

With option 3 I would expect to be able to route traffic not only with a path prefix but with a whole url prefix. In that way option 2 and 3 would work the same way, as one could dynamically add the consul service hostname (<service>.ingress.consul) and the path prefix (/<servicename>) to the list of url prefixes associated to the service.

The url prefixes associated to the service may be stored in a service tag dedicated to the ingress proxy like fabio does, or added to the service definition (which could easier integration with other products but requires more work).

We just released Consul 1.8 which now includes a built-in Ingress gateway powered by Envoy. See [consul] ANN: Consul 1.8.0 beta1 Released for details.

Weā€™d like to hear any feedback you may have on this new feature, and whether it will satisfy your use case.

At least for our use case - edge proxy with service mesh capability - it seems to do the job. Been experimenting with various approaches but this worked right away. Thank you!

1 Like

By the way, option 3 - path based, doesnā€™t seem to be in the mix. Any ideas on how we can achive that? Tried latest 1.8.0-beta2

Hi @pvyaka01,

Ingress gateways do support path-based routing. Iā€™m linking the issue you also filed on GitHub where one of our engineers, Chris, provided a configuration example.

2 Likes

Yes, thanks for your response, I can confirm that the provided method works. I came across these in the test cases in GitHub as well.

Itā€™d be awesome for others if this example was also in your documentation for the service-router and config entry. The documentation only states to set the Name to the service being configured, it needs to be more clear the service doesnā€™t have to actually exist and can be ā€œvirtualā€, where L7 routing can be applied right at the ingress gateway.

@ericbrumfield, Thanks for the suggestion. I just opened https://github.com/hashicorp/consul/pull/8672 to improve the documentation for this.

Does this look sufficient to address the gaps you identified? Would you like to see any additional info included?

1 Like