Consul Connect: how to improve performance

jrnijboer · November 2, 2020, 10:11am

We’re in a process of encrypting traffic between services in our datacenter. We’re hosting several services on virtual machines that run on our own hypervisors, so we have full control of all servers. We have set up a consul cluster on 3 VM’s and are also running a Nomad cluster on those same 3 VM’s.

We tried encrypting traffic between our loadbalancers and our login services by using Consul Connect and implementing sidecar proxies. However, the dev-team of the login services did some performance testing and found out that response times had roughly doubled after switching traffic over the sidecar proxies. Several endpoints and static content of the login service have been tested, the baseline response times without encrypting traffic between loadbalancer and login service was roughly between 150 and 250 ms. The response times after using Consul Connect were roughly between 250 and 500ms. The only change was sending traffic through the sidecars instead of direct to the upstream services.

request	avg response time without connect (>10k requests)	avg response time with connect (>10k requests)
endpoint1	165ms	349ms
endpoint2	174ms	349ms
endpoint3	228ms	510ms
/favicon.ico	156ms	223ms

Our login services are .NET applications hosted on IIS on Windows Server 2019, because we’re running on Windows we are using the built-in Consul Connect proxy. Our loadbalancers are running Nginx in Docker, deployed by Nomad, so we can use the Nomad and Consul integration there and are using Envoy proxies in Docker.

Our Nomad job for the loadbalancer looks similar to this:

  datacenters = ["dc"]
  type = "system"

  group "containers" {

    network {
      mode = "bridge"
      port "https" {
        static = 443
        to     = 443
      }
    }

    service {
      name = "loadbalancer"
      port = "https"
      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "login-backend-service"
              local_bind_port = 8000
            }
          }
        }
      }
    }

    task "nginx" {
      template {
        data = <<EOF
  upstream login-backend-upstream {
    server 127.0.0.1:8000;
    keepalive 50;
  }

  server {
    listen 443 ssl http2;
    server_name login.__domain__;

    location / {
      proxy_pass http://login-backend-upstream;
      proxy_http_version 1.1;
      proxy_set_header Connection "";
      proxy_set_header Host $host;
    }
  }
EOF
        destination = "local/conf.d/default.conf"
        change_mode = "signal"
        change_signal = "SIGHUP"
      }

      driver = "docker"

      config {
        image = "ourcontainerregistry.io/loadbalancer:__ContainerVersion__"
        auth = {
          username = "nomad"
          password = "password"
        }

        volumes = [
          "local/conf.d/:/etc/nginx/conf.d/"
        ]
      }      
    }
  }
}

Our Connect sidecar proxy on Windows is started by a Nomad job similar to this:

  datacenters = ["dc"]
  type = "system"

  constraint {
    attribute = "${attr.kernel.name}"
    value     = "windows"
  }
  constraint {
    attribute = "${attr.unique.hostname}"
    operator = "regexp"
    value = "loginserver.*"
  }

  group "sidecar" {
    task "start-sidecar" {
      driver = "raw_exec"
      kill_timeout = "10s"
      artifact {
        source = "https://releases.hashicorp.com/consul/{{ consul_version }}/consul_{{ consul_version }}_windows_amd64.zip"
        destination = "local/consul"
      }
      config {
        command = "local/consul/consul.exe"
        args = ["connect", "proxy", "-token", "{{ consul_token }}", "--sidecar-for", "login-backend-service"]
      }
    }
  }
}

I don’t think the response times should roughly double by implementing encryption on calls in our own datacenter, I believe TLS overhead should be much less. I’m however quite clueless which knobs and dials are available to improve the performance. The obvious first check should be to check if every single call from loadbalancer to upstream service causes a new TCP connection and therefore a new TLS handshake. I believe we have configured our loadbalancer to have keepalive connections to the local sidecar proxy, but that doesn’t necesarily mean that connections between the local and the remote sidecar are kept open? Then again, I have no idea how to verify when handshakes take place, nor how to verify if and where connections stay open.

Are there any guidelines on investigating what is going on and how to improve our performance? We wouldn’t mind that intentions cannot be revoked immediately because of keepalive connections if we get better response times instead. I’m happy to provide more information if needed.

blake · December 6, 2020, 7:13am

Hi @jrnijboer,

Are you seeing this high latency directly between Envoy instances, or only between Envoy and the built-in proxy?

I ask because if it is only the latter, I wanted to call out that the built-in proxy was primarily meant for basic development or testing of Connect, and not for production use (see https://www.consul.io/docs/connect/proxies/built-in). We don’t perform any performance testing or tuning against it as we have done with Envoy, and as such its possible there is some issue which is causing the increased latency.

Would mind also sharing which versions of Consul, Nomad, and Envoy you’re using in your environment?

Thanks.