Honoring Upstream Peer Service Router Configurations

sluebbert · September 19, 2024, 2:44pm

We currently operate with two different Nomad/Consul/Vault clusters that are connected to each other via a Nomad controlled mesh gateway to facilitate peering with the intent to add more data centers in the future.

For the sake of examples, I’ll call the two data centers “d-mwsv” and “d-mw3p”.
They contain the following services:

Service	Data Center
myclient	d-mwsv
myservice	d-mw3p

Apps in d-mwsv communicate with upstreams in d-mw3p by using the Nomad upstream block with the destination_peer field filled out.

So the Nomad job for “myclient” would contain an upstream block that looks like this:

upstreams {
    destination_name = "myservice"
    destination_peer = "d-mw3p"
    local_bind_port = 9001
}

Communication from “myclient” to “myservice” works just fine this way.

The confusion for us comes from when we need to get around the default 15s timeout for communication.

When the communication between services remains in a single data center, setting up a “service-router” like this works great:

{
    "Kind": "service-router",
    "Name": "some-other-service-in-d-mwsv",
    "Routes": [
        {
            "Match": {
                "HTTP": {
                    "PathPrefix": "/"
                }
            },
            "Destination": {
                "RequestTimeout": "1m0s"
            }
        }
    ]
}

If the service “myclient” in d-mwsv wants to talk to the service “some-other-service-in-d-mwsv” in the same data center, it automatically gets this 1 minute timeout assigned to its sidecar configuration. We can see it when dumping the configs:

"dynamic_route_configs": [
    {
     "route_config": {
      "@type": "type.googleapis.com/envoy.config.route.v3.RouteConfiguration",
      "name": "some-other-service-in-d-mwsv",
      "virtual_hosts": [
       {
        "name": "some-other-service-in-d-mwsv",
        "domains": [
         "*"
        ],
        "routes": [
         {
          "match": {
           "prefix": "/"
          },
          "route": {
           "cluster": "some-other-service-in-d-mwsv.default.d-mwsv.internal.84c86036-c135-66ea-4740-762440d9c83d.consul",
           "timeout": "60s"
          }
         },
         {
          "match": {
           "prefix": "/"
          },
          "route": {
           "cluster": "some-other-service-in-d-mwsv.default.d-mwsv.internal.84c86036-c135-66ea-4740-762440d9c83d.consul"
          }
         }
        ]
       }
      ]
     },
     "last_updated": "2024-09-19T13:44:19.721Z"
    }

While this works great for connections that remain in a single data center, it does not work when the connections are among peer data centers.

For example, the following service router config in d-mw3p does not get picked up by clients coming from d-mwsv:

{
    "Kind": "service-router",
    "Name": "myservice",
    "Routes": [
        {
            "Match": {
                "HTTP": {
                    "PathPrefix": "/"
                }
            },
            "Destination": {
                "RequestTimeout": "1m0s"
            }
        }
    ]
}

Clients in d-mwsv get the envoy config:

{
     "route_config": {
      "@type": "type.googleapis.com/envoy.config.route.v3.RouteConfiguration",
      "name": "myservice?peer=d-mw3p",
      "virtual_hosts": [
       {
        "name": "myservice.default.d-mw3p",
        "domains": [
         "*"
        ],
        "routes": [
         {
          "match": {
           "prefix": "/"
          },
          "route": {
           "cluster": "myservice.default.d-mw3p.external.c80dc7ba-6260-7957-84b8-6d9fc46e7f8e.consul"
          }
         }
        ]
       }
      ]
     },
     "last_updated": "2024-09-17T20:55:36.125Z"
    }

^ Note the missing timeout.

A work around has been identified if we follow the guide here but by using a resolver redirect instead of a failover. This seems more complicated to manage though as it requires us to keep track of the lifecycle for two extra consul configs and to manage the timeout for a service in every data center that may use it as an upstream.

To state it differently:
Today we manage timeouts by managing a single service-router config for the destination service. All clients dynamically pick the timeout up.

If we use the work around above we end up doing what we are above, but with the addition of a service-resolver and service-router in any additional data center that wants to send requests to the peer.

Is there something we are missing to make the single service-router config work?

Thanks,

Topic		Replies	Views
Consul service mesh, clarification around request timeout Consul connect	4	604	May 21, 2024
Where and How to specify mesh_gateway / datacenter stanza in Nomad Job file Nomad	0	356	September 23, 2020
Service-mesh, envoy, upstream config. Blocks of type "config" are not expected here Nomad	8	1271	November 2, 2021
How to specify Consul Connect upstream in a different datacenter? Nomad	2	290	March 19, 2021
Exported services across peered cluster are unable to communicate. `destination_peer` parameter in upstream job not accepted Consul connect	0	204	April 18, 2023

Honoring Upstream Peer Service Router Configurations

Related topics