Envoy proxy getting 503 during canary deployments

mads.holden · March 15, 2023, 2:48pm

We are currently running a number of services on Nomad. We are now
trying to move their communication to use Consul Connect instead of an
AWS load balancer. Our problem is that we’re getting a couple of 503s
during redeployments.

Our Nomad services always use canary blue/green deployments to ensure
safe deployments.

update {
  max_parallel = 3
  canary = 3
}

The services have Envoy sidecars configured to talk to each other.
The following is an example of two services configured the way our
services are.

count = 2

service {
  name = "service-a"
  port = "service"

  check {
    type = "http"
    port = "service"
    path = "/health"
  }
}

network {
  mode = "bridge"
  port "service" {}
}

connect {
  sidecar_task {
    config {
      image = "envoyproxy/envoy:v1.25-latest"
    }
  }

  sidecar_service {
    proxy {
      upstreams {
        destination_name = "service-b"
        local_bind_port  = 8000
      }
    }
  }
}

shutdown_delay = "10s"

task "service-a" {
  driver = "docker"

  config {
    image = "service-a"
    ports = ["service"]
  }

  env {
    LISTEN_PORT       = "${NOMAD_PORT_service}"
    SERVICE_B_ADDRESS = "${NOMAD_UPSTREAM_ADDR_service_b}"
  }
}

count = 2

service {
  name = "service-b"
  port = "service"

  check {
    type = "http"
    port = "service"
    path = "/health"
  }
}

network {
  mode = "bridge"
  port "service" {}
}

connect {
  sidecar_task {
    config {
      image = "envoyproxy/envoy:v1.25-latest"
    }
  }

  sidecar_service {
    proxy { }
  }
}

shutdown_delay = "10s"

task "service-b" {
  driver = "docker"

  config {
    image = "service-b"
    ports = ["service"]
  }

  env {
    LISTEN_PORT = "${NOMAD_PORT_service}"
  }
}

These can talk to each other without any problems, and everything
looks good until we deploy a new version. Considering a redeployment
of service-b: Nomad will start two new instances of the service, and
register them in Consul. After service-b is healthy, service-a will
start using both the old and new instances of service-b. After some
time, our deployment process will promote the deployment. This causes
Nomad to deregister the old instances from Consul, wait 10 seconds,
then kill those instances. During those 10 seconds the old instances
will not be listed in Consul, but Envoy will still route requests there,
since the health checks still return 200. When the instances are killed,
a few requests will reach the old instances before Envoy stops using them,
resulting in a few 503s.

It looks like we are following all recommendations while setting up these
services together, but still we are seeing those 503s. Is there a way
to configure something in this setup to avoid Envoy routing traffic to
the old instances?

Topic		Replies	Views
Consul Connect + Envoy - gRPC issues Consul	3	2091	October 15, 2021
Nomad consul connect failing seperate nodes Nomad	4	896	February 24, 2022
Connection refused on sidecar proxy Nomad connect , consul-nomad , service-mesh	2	892	October 19, 2023
Getting to grips with sidecar_service, consul and service mesh Nomad connect , consul	0	456	May 3, 2022
Is it possible to set envoy as default proxy for Consul sidecars? Consul connect , consul-nomad	2	598	November 11, 2021

Envoy proxy getting 503 during canary deployments

Related topics