Consul Connect connect_timeout best practices (1.10, service-defaults upstream configs)

We’ve been preparing to start testing Consul 1.10, and it’s interesting to me to see that with the new UpstreamConfig configuration for services, that ConnectTimeoutMs has the following note:

NOTE: The connect timeout of a service should ideally be configured via the connect_timeout field of a service-resolver config entry for the upstream destination service. Configuring it in a proxy upstream config will not fully enable some L7 features. It is supported here for backwards compatibility with Consul versions prior to 1.6.0.

This looks like copy & paste from the original proxy.upstreams[].config field in a proxy service definition: Envoy Proxy Configuration | Service Mesh | Consul | HashiCorp Developer. I understand the limitations of there, but given service-defaults works through the discovery chain & everything else modern I was curious if this recommendation still applies here?

I’ve had a bit of a dig through the source and haven’t been able to see which L7 functionality would be lost by configuring the connect timeout on a service-defaults instead of a service-resolver. Is there something or should this note be removed?

The reason I ask is, it seems like service-defaults is a pretty reasonable spot to configure for upstreams these timeouts or at least the default for them. It seems correct that because you’re configuring the originating Envoy’s timeouts that you would want to do it on the originating service’s service-defaults. If you have a service-splitter or service-resolver in place you might have different concrete services which require different timeouts (hence why configurability exists at those config entry layers), but UpstreamConfig might be a good spot to configure defaults.

As a concrete example:

Kind = "service-defaults"
Name = "counting"

UpstreamConfig = {
  Defaults = {
    ConnectTimeoutMs = 1000
  }

  Overrides = [
    {
      Name = "dashboard"
      ConnectTimeoutMs = 2000
    }
  ]
}

vs

Kind           = "service-resolver"
Name           = "dashboard"
ConnectTimeout = "2s"

Given we’re controlling the timeout for counting -> counting Envoy (THIS ONE) -> dashboard Envoy -> dashboard, it does make sense to me that the configuration exists for counting.

The other reason I ask is, I’d love to implement RequestTimeout as an UpstreamConfig option, instead of having to create a service-router for every service to configure it (in our situation, we prefer every originating/egress Envoy to have no outbound request timeout – we enforce these on the ingress side). Before I do so & contribute upstream, I’d like to make sure this change is aligned.

I’d actually forgotten about @freddygv’s comment back to me here: Consul connect provide a way to configure envoy route timeout · Issue #6382 · hashicorp/consul · GitHub

Regarding the note about connect timeout, there are two points there. First is that yes we generally want to move away from per-instance configuration. However, if the discovery chain isn’t being configured, I can see the benefit of allowing this flag to be set elsewhere. Now that we have centralized upstream config in service-defaults it wouldn’t be exclusively in individual proxy registrations.

I think both points here make sense to me - it does seem to put forward that connect_timeout_ms on a servie-defaults is here to stay, so maybe the deprecation note in the docs should be removed?

I’m wondering if it makes sense before 1.10 drops to refactor the timeouts configuration on UpstreamConfig to:

Timeouts {
  Connect = "0s" // with actual support for duration values instead of just milliseconds (at parity with other configurations)
  Request = "0s"
}
1 Like