We currently operate with two different Nomad/Consul/Vault clusters that are connected to each other via a Nomad controlled mesh gateway to facilitate peering with the intent to add more data centers in the future.
For the sake of examples, I’ll call the two data centers “d-mwsv” and “d-mw3p”.
They contain the following services:
Service | Data Center |
---|---|
myclient | d-mwsv |
myservice | d-mw3p |
Apps in d-mwsv communicate with upstreams in d-mw3p by using the Nomad upstream block with the destination_peer field filled out.
So the Nomad job for “myclient” would contain an upstream block that looks like this:
upstreams {
destination_name = "myservice"
destination_peer = "d-mw3p"
local_bind_port = 9001
}
Communication from “myclient” to “myservice” works just fine this way.
The confusion for us comes from when we need to get around the default 15s timeout for communication.
When the communication between services remains in a single data center, setting up a “service-router” like this works great:
{
"Kind": "service-router",
"Name": "some-other-service-in-d-mwsv",
"Routes": [
{
"Match": {
"HTTP": {
"PathPrefix": "/"
}
},
"Destination": {
"RequestTimeout": "1m0s"
}
}
]
}
If the service “myclient” in d-mwsv wants to talk to the service “some-other-service-in-d-mwsv” in the same data center, it automatically gets this 1 minute timeout assigned to its sidecar configuration. We can see it when dumping the configs:
"dynamic_route_configs": [
{
"route_config": {
"@type": "type.googleapis.com/envoy.config.route.v3.RouteConfiguration",
"name": "some-other-service-in-d-mwsv",
"virtual_hosts": [
{
"name": "some-other-service-in-d-mwsv",
"domains": [
"*"
],
"routes": [
{
"match": {
"prefix": "/"
},
"route": {
"cluster": "some-other-service-in-d-mwsv.default.d-mwsv.internal.84c86036-c135-66ea-4740-762440d9c83d.consul",
"timeout": "60s"
}
},
{
"match": {
"prefix": "/"
},
"route": {
"cluster": "some-other-service-in-d-mwsv.default.d-mwsv.internal.84c86036-c135-66ea-4740-762440d9c83d.consul"
}
}
]
}
]
},
"last_updated": "2024-09-19T13:44:19.721Z"
}
While this works great for connections that remain in a single data center, it does not work when the connections are among peer data centers.
For example, the following service router config in d-mw3p does not get picked up by clients coming from d-mwsv:
{
"Kind": "service-router",
"Name": "myservice",
"Routes": [
{
"Match": {
"HTTP": {
"PathPrefix": "/"
}
},
"Destination": {
"RequestTimeout": "1m0s"
}
}
]
}
Clients in d-mwsv get the envoy config:
{
"route_config": {
"@type": "type.googleapis.com/envoy.config.route.v3.RouteConfiguration",
"name": "myservice?peer=d-mw3p",
"virtual_hosts": [
{
"name": "myservice.default.d-mw3p",
"domains": [
"*"
],
"routes": [
{
"match": {
"prefix": "/"
},
"route": {
"cluster": "myservice.default.d-mw3p.external.c80dc7ba-6260-7957-84b8-6d9fc46e7f8e.consul"
}
}
]
}
]
},
"last_updated": "2024-09-17T20:55:36.125Z"
}
^ Note the missing timeout.
A work around has been identified if we follow the guide here but by using a resolver redirect instead of a failover. This seems more complicated to manage though as it requires us to keep track of the lifecycle for two extra consul configs and to manage the timeout for a service in every data center that may use it as an upstream.
To state it differently:
Today we manage timeouts by managing a single service-router config for the destination service. All clients dynamically pick the timeout up.
If we use the work around above we end up doing what we are above, but with the addition of a service-resolver and service-router in any additional data center that wants to send requests to the peer.
Is there something we are missing to make the single service-router config work?
Thanks,