How to debug Consul sidecar in Nomad? "curl: (56) Recv failure: Connection reset by peer"

ersch · July 31, 2023, 7:26pm

I’m building a single-node cluster (for development), with Consul and Nomad set up.

I have two services, weba and webb, and webb can make a request to weba.
Each service responds correctly when queried directly on the Nomad allocation port.
But webb fails to query weba via the consul sidecar (local_bind_port).
When I run curl localhost:8901 via the container shell, it returns:
curl: (56) Recv failure: Connection reset by peer.
When I examine the envoy sidecar logs via Nomad UI, I don’t see any errors.
The systemd log for nomad and consul also has no errors.

So what’s going on? Where is the failure?

Maybe it’s caused by…

docker network failure? but the response from curl suggests that envoy is responding on the port
envoy sidecar configuration failure? but there are no errors
intention misconfiguration? it has “allow” for webb → weba in the UI, but perhaps something is missing? would intention acl errors be logged?
Consul routing/networking failure? but there are no errors.

This is very puzzling. Any ideas?

–

What actually happens with local_bind_port? The container (netstat -pln) listens on:

tcp        0      0 127.0.0.2:19001         0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:31673           0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:8901          0.0.0.0:*               LISTEN      -                   
tcp6       0      0 :::8202                 :::*                    LISTEN      1/node

and the Envoy container:

tcp        0      0 0.0.0.0:27693           0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.2:19001         0.0.0.0:*               LISTEN      -                   
tcp6       0      0 :::8201                 :::*                    LISTEN      -

So is the port forwarded via iptables, or what is going on?
And why is it listening on ipv6? I.e. envoy’s :::8201 seems to relate to the weba service port 8201.

–

I’m worried about having a Consul failure in production, and not knowing where to find the problem. Do you have any general advice for how to approach this type of problem?
For example, to use consul connect ... to make a direct connection? What do you do, when you have to debug in production?

I suppose the best practice is for the service to implement health checks that it can reach its dependencies? How do you usually implement this? (third party tool?)
Should the dependencies be part of the health check? e.g. a /health_dependencies that checks that it can connect to a database on “localhost:12345”… but this could cause a cascading failure if the database goes down, which is harder to debug?

–

weba.nomad

job "webajob" {
  consul_token = "723c8e29-1d9c-ff6a-0112-03da18e2b21b"
  group "webagroup" {
    network {
      mode = "bridge"
      port "http" { to = 8201 }
    }
    service {
      name = "weba"
      port = "http"

      connect {
        sidecar_service {}
      }
    }
    task "webatask" {
      driver = "docker"
      config {
        image = "webserver:v3"
        ports = ["http"]
      }
      env {
        PORT = 8201
      }
    }
  }
}

webb.nomad

job "webbjob" {
  consul_token = "b8d1c486-6f42-649e-6f47-229bcf01b42d"
  group "webbgroup" {
    network {
      mode = "bridge"
      port "http" { to = 8202 }
    }
    service {
      name = "webb"
      port = "http"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "weba"
              local_bind_port  = 8901
            }
          }
        }
      }
    }
    task "webbtask" {
      driver = "docker"
      config {
        image = "webserver:v3"
        ports = ["http"]
      }
      env {
        PORT       = 8202
        ENDPOINT_A = "http://localhost:8901"
      }
    }
  }
}

envoy log

[2023-07-31 18:07:33.220][1][info][admin] [source/server/admin/admin.cc:66] admin address: 127.0.0.2:19001
[2023-07-31 18:07:33.221][1][info][config] [source/server/configuration_impl.cc:131] loading tracing configuration
[2023-07-31 18:07:33.221][1][info][config] [source/server/configuration_impl.cc:91] loading 0 static secret(s)
[2023-07-31 18:07:33.221][1][info][config] [source/server/configuration_impl.cc:97] loading 1 cluster(s)
[2023-07-31 18:07:33.263][1][info][config] [source/server/configuration_impl.cc:101] loading 0 listener(s)
[2023-07-31 18:07:33.264][1][info][config] [source/server/configuration_impl.cc:113] loading stats configuration
[2023-07-31 18:07:33.264][1][info][runtime] [source/common/runtime/runtime_impl.cc:463] RTDS has finished initialization
[2023-07-31 18:07:33.264][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:221] cm init: initializing cds
[2023-07-31 18:07:33.264][1][warning][main] [source/server/server.cc:802] there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections
[2023-07-31 18:07:33.264][1][info][main] [source/server/server.cc:923] starting main dispatch loop
[2023-07-31 18:07:33.268][1][info][upstream] [source/common/upstream/cds_api_helper.cc:32] cds: add 2 cluster(s), remove 0 cluster(s)
[2023-07-31 18:07:33.297][1][warning][misc] [source/common/protobuf/message_validator_impl.cc:21] Deprecated field: type envoy.extensions.transport_sockets.tls.v3.CertificateValidationContext Using deprecated option 'envoy.extensions.transport_sockets.tls.v3.CertificateValidationContext.match_subject_alt_names' from file common.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/version_history/version_history for details. If continued use of this field is absolutely necessary, see https://www.envoyproxy.io/docs/envoy/latest/configuration/operations/runtime#using-runtime-overrides-for-deprecated-features for how to apply a temporary and highly discouraged override.
[2023-07-31 18:07:33.370][1][info][upstream] [source/common/upstream/cds_api_helper.cc:69] cds: added/updated 2 cluster(s), skipped 0 unmodified cluster(s)
[2023-07-31 18:07:33.370][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:199] cm init: initializing secondary clusters
[2023-07-31 18:07:33.375][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:225] cm init: all clusters initialized
[2023-07-31 18:07:33.375][1][info][main] [source/server/server.cc:904] all clusters initialized. initializing init manager
[2023-07-31 18:07:33.379][1][info][upstream] [source/extensions/listener_managers/listener_manager/lds_api.cc:79] lds: add/update listener 'public_listener:0.0.0.0:31673'
[2023-07-31 18:07:33.380][1][info][upstream] [source/extensions/listener_managers/listener_manager/lds_api.cc:79] lds: add/update listener 'weba:127.0.0.1:8901'
[2023-07-31 18:07:33.380][1][info][config] [source/extensions/listener_managers/listener_manager/listener_manager_impl.cc:858] all dependencies initialized. starting workers

Ranjandas · August 1, 2023, 8:41am

Hi @ersch,

I guess the reason why your service mesh connectivity is failing is due to the use of a named port in job.group.service.port in the webajob.

Can you change the port to a hard-coded one and see if it works (or change the port to static 8201)?

job "webajob" {
  consul_token = "723c8e29-1d9c-ff6a-0112-03da18e2b21b"
  group "webagroup" {
    ...
    service {
      name = "weba"
      port = "8201"     # hard-code the port here
    ...
    }

This is documented here: Consul Service Mesh | Nomad | HashiCorp Developer.

The port in the service block is the port the API service listens on. The Envoy proxy will automatically route traffic to that port inside the network namespace. Note that currently this cannot be a named port; it must be a hard-coded port value.

For future issues, it will also help if you share your Consul and Nomad configuration.

I hope this helps.

ersch · August 1, 2023, 9:47am

Thanks, that solved it.

Any idea why it didn’t show up in the logs?

Ranjandas · August 1, 2023, 11:18am

It will appear in the envoy logs of weba provided you increase the log level to DEBUG.

You can do this by exec’ing into the weba alloc, and running the following command:

curl 127.0.0.2:19001/logging?level=debug -X POST

With the above change, when you initiate a connection from webb via the local_bind_port, you will see the following error in weba envoy.

# extract, not full log
[2023-08-01 11:09:47.270][15][debug][connection] [source/common/network/connection_impl.cc:941] [C83] connecting to 127.0.0.1:27818
[2023-08-01 11:09:47.270][15][debug][connection] [source/common/network/connection_impl.cc:960] [C83] connection in progress
[2023-08-01 11:09:47.270][15][debug][connection] [source/common/network/connection_impl.cc:699] [C83] delayed connect error: 111
[2023-08-01 11:09:47.270][15][debug][connection] [source/common/network/connection_impl.cc:250] [C83] closing socket: 0
[2023-08-01 11:09:47.270][15][debug][pool] [source/common/conn_pool/conn_pool_base.cc:484] [C83] client disconnected, failure reason: delayed connect error: 111
[2023-08-01 11:09:47.270][15][debug][pool] [source/common/conn_pool/conn_pool_base.cc:454] invoking idle callbacks - is_draining_for_deletion_=false
[2023-08-01 11:09:47.515][1][debug][main] [source/server/server.cc:265] flushing stats

If you look at the clusters populated by envoy, you will find that 127.0.0.1:27818 is the address of your weba app where 27818 is the dynamic port which is allocated for the http port you defined in the jobSpec. And, the connection fails, because that dynamic port doesn’t exist inside the namespace where envoy runs.

/# curl -s 127.0.0.2:19001/clusters | grep local_app | grep hostname
local_app::127.0.0.1:27818::hostname::

I hope this helps.

ersch · August 4, 2023, 11:15am

Thanks! I was able to run the commands. Much appreciated!

Topic		Replies	Views
Trouble getting Consul Connect and Envoy to work Consul connect	24	8680	September 16, 2020
Envoy -> consul "upstream connect error or disconnect/reset before headers. reset reason: connection termination" Consul connect	5	6151	March 9, 2023
Consul agent unable to talk to sidecar after firewalld restart Nomad	3	539	January 7, 2022
Existing nomad jobs in mesh need to be restarted after consul tls is enabled Consul	0	269	July 27, 2021
Restarting a job in nomad with consul connect sidecar causes the proxy to break Nomad connect , consul-nomad , consul , nomad	2	599	February 3, 2023

How to debug Consul sidecar in Nomad? "curl: (56) Recv failure: Connection reset by peer"

Related topics