Consul Connect + Envoy - gRPC issues

Hello Nomad team and community.

I am having troubles configuring Consul connect with Envoy proxy on AWS, and I would appreciate some guidance on how to proceed or troubleshoot it. In short, connect-proxy is throwing warnings in stderr like this:

[2021-04-23 02:03:39.373][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:87] gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection termination

I am running a cluster on consul-1.9.5, nomad-1.0.4, and envoy-1.16.2.

Here is a test job that uses Consul connect (I dropped healthcheck for the time being, otherwise deployment stucks):

job "http-connect" {
 datacenters = ["us-east-1c"]

 group "echo" {
   network {
     mode = "bridge"
   }

   service {
     name = "http-connect"
     port = "8080"

     connect {
       sidecar_service {}
     }
   }

   task "server" {
     driver = "docker"

     config {
       image = "hashicorp/http-echo:latest"

       args = [
         "-listen", ":8080",
         "-text", "Hello and welcome to http-echo running on port 8080",
       ]
     }
   }
 }
}

This job can be successfully deployed to Nomad and registered in Consul via “nomad run”. I can also use exec connect-proxy alloc and get inside the container.

Security groups in AWS are configured to accept traffic from 8300, 8301, 8302, 8400, 8500, 8502, 8600, 21000-21255 port (pretty much what’s in Required Ports | Consul by HashiCorp except an extra 8400 port). All outbound ports are open.

Nomad agents open 4646, 4647, 4648 (both TCP and UDP) and 20000-32000 dynamic port range.

From a connect-proxy instance, I can query Consul server:

$ curl 10.0.1.145:8500
<a href="/ui/">Moved Permanently</a>.

Another request to Consul 8502 returns something and closes a connection:

$ curl 10.0.1.145:8502 | wc -c
 % Total % Received % Xferd Average Speed Time Time Time Current
                                Dload Upload Total Spent Left Speed
100 42 0 42 0 0 10388 0 --:--:-- --:--:-- --:--:-- 14000
curl: (56) Recv failure: Connection reset by peer
21

Consul intentions allow traffic from all services to all services.

Nonetheless, a curl is stuck on connecting to a sock file, and I suspect that it also results in the “gRPC config stream closed: 14” error that I mentioned above.

$ curl --unix-socket /alloc/tmp/consul_grpc.sock http:/v1/config -v
* Trying /alloc/tmp/consul_grpc.sock...
^C

At this point I am a bit lost, and I would appreciate any ideas what is missing or could be wrong in my setup.

I am likely missing some important detail in the post, but happy to drop config files or anything else if it helps.

A bit more troubleshooting. I ran consul monitor -log-level debug on all instances. There are no warnings/errors on Consul servers instances, but Consul on a Nomad client shows this:

2021-04-24T18:05:04.351Z [WARN] agent: Check socket connection failed: check=service:_nomad-task-1581dbec-92a2-1956-e313-39d8f7c2bc3d-group-echo-http-connect-8080-sidecar-proxy:1 error="dial tcp 10.0.1.238:25100: connect: connection refused"
2021-04-24T18:05:04.351Z [WARN] agent: Check is now critical: check=service:_nomad-task-1581dbec-92a2-1956-e313-39d8f7c2bc3d-group-echo-http-connect-8080-sidecar-proxy:1

This is expected as a sidecar proxy is not listening on this port, I ran it on a Nomad client instance where the application is deployed:

$ netstat -oan | grep 25100 | wc -l
0

Nonetheless, Nomad server shows that it’s been assigned to 25100 port:

Also, there are containers running on the instance:

$ docker ps --no-trunc
CONTAINER ID                                                       IMAGE                                                                                              COMMAND                                                                                                  CREATED              STATUS              PORTS     NAMES
367a1b94b3b8d7b6e6666dddf0b1bad8cd8491b69c653644227024ab89dff35f   hashicorp/http-echo:latest                                                                         "/http-echo -listen :8080 -text 'Hello and welcome to http-echo running on port 8080'"                   About a minute ago   Up About a minute             server-c86d38e8-c0ac-d001-142a-46408695cebf
53e56869810ff925a1d6063b49ae253dd2ee43d2774bd6e13674acdcada48d5b   envoyproxy/envoy:v1.11.2@sha256:a7769160c9c1a55bb8d07a3b71ce5d64f72b1f665f10d81aa1581bc3cf850d09   "/docker-entrypoint.sh -c /secrets/envoy_bootstrap.json -l info --concurrency 1 --disable-hot-restart"   About a minute ago   Up About a minute             connect-proxy-http-connect-c86d38e8-c0ac-d001-142a-46408695cebf
883e96e43504bf53af1d38a65445174af9fe90f076561388718bd6f7724035eb   gcr.io/google_containers/pause-amd64:3.1                                                           "/pause"                                                                                                 About a minute ago   Up About a minute             nomad_init_c86d38e8-c0ac-d001-142a-46408695cebf

However, a sidecar proxy is not exposing its port. It is curious to know which config file is responsible for that, could it be a silly mistake in my Nomad config?