"Connection refused" to local Envoy sidecar

jfcantu · September 24, 2020, 2:50am

I’m about to lose my mind.

My downstream service, for whatever reason, cannot connect Envoy to reach the upstream:

time="2020-09-23T22:46:47Z" level=info msg="Connecting to database at: bolt://127.0.0.1:7687"
time="2020-09-23T22:46:47Z" level=warning msg="Connection error: dial tcp 127.0.0.1:7687: connect: connection refused"

Here are my service definitions in Nomad:

Upstream:

    service {
      name = "${BASE}-bolt-internal"
      port = "bolt"
      connect {
        sidecar_service {}
      }
    }

Downstream:

    service {
      name = "${BASE}-api"
      port = "web"
      check {
        type     = "http"
        port     = "web"
        path     = "/"
        interval = "60s"
        timeout  = "10s"
      }
      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "jnexus-database-group-bolt-internal"
              local_bind_port  = 7687
            }
          }
        }
      }
    }

Sidecar logs from the downstream don’t show anything abnormal. The listener starts and then the sidecar gets terminated because the main process dies.

[2020-09-24 02:27:32.674][1][info][main] [source/server/server.cc:500] all clusters initialized. initializing init manager
[2020-09-24 02:27:32.681][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'public_listener:0.0.0.0:31919'
[2020-09-24 02:27:32.682][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'jnexus-database-group-bolt-internal:127.0.0.1:7687'
[2020-09-24 02:27:32.682][1][info][config] [source/server/listener_manager_impl.cc:761] all dependencies initialized. starting workers
[2020-09-24 02:28:15.264][1][warning][main] [source/server/server.cc:468] caught SIGINT
[2020-09-24 02:28:15.264][1][info][main] [source/server/server.cc:567] shutting down server instance
[2020-09-24 02:28:15.264][1][info][main] [source/server/server.cc:521] main dispatch loop exited
[2020-09-24 02:28:15.265][1][info][main] [source/server/server.cc:560] exiting

martinkingtw · September 24, 2020, 3:32am

Are you sure that ${BASE} would resolve to jnexus-database-group?
Why not try a static name first, to see whether it will work or not?

jfcantu · September 24, 2020, 4:31am

Nope, same problem.

I should also mention that I have Consul ACLs enabled and intentions configured between the services.

martinkingtw · September 24, 2020, 6:06am

Did you give nomad enough access to register the services to consul?

You have to create token for nomad server and client.

Example

agent_prefix "" {
  policy = "read"
}

node_prefix "" {
  policy = "read"
}

service_prefix "" {
  policy = "write"
}

acl = "write"

Read it here.

Alternatively, are you 100% sure that the shutdown is caused by not finding services on port 7689? It might be some other problems causing SIGINT.

jfcantu · September 24, 2020, 1:44pm

Yes. The upstream shows as registered in Consul and passes health checks, and Nomad shows that the sidecar is getting terminated because the task it’s attached to has failed.

jfcantu · September 26, 2020, 4:06am

Boy, was this an adventure to figure out.

TL;DR - Don’t mix Nomad’s and Docker’s bridge modes - they’re different.

I had followed this guide, which details how to configure Consul DNS to work inside Docker containers by creating a dummy network interface that can passed to the Docker --dns option.

The --dns option works just fine with Docker’s bridge mode. It does not work with Nomad’s bridge mode - because Nomad’s bridge mode sets the Docker networking mode to none so the CNI plugin can be used instead. If you try and specify config.dns_servers for a Nomad task that’s operating in Nomad bridge mode, you’ll get an error from Docker saying that dns_servers is incompatible with your networking mode.

So of course, when I got this error, I simply set Docker’s config.networking_mode to bridge for the task, the error disappeared, and I went on my merry way.

Unbeknownst to me, this had switched me away from using the CNI plugin, and I was now in Docker’s bridge mode in the default bridge network.

And, well - in that mode, 127.0.0.1 is not shared between containers… so when my service tried to reach the sidecar, it was trying to connect back to its own container, where Envoy isn’t running.

Topic		Replies	Views
Connection refused on sidecar proxy Nomad connect , consul-nomad , service-mesh	2	868	October 19, 2023
Envoy -> consul "upstream connect error or disconnect/reset before headers. reset reason: connection termination" Consul connect	5	5984	March 9, 2023
Connection refused on sidecar proxy: can't get minimal example to work Nomad connect	3	1117	September 16, 2022
How to debug Consul sidecar in Nomad? "curl: (56) Recv failure: Connection reset by peer" Nomad connect , consul-nomad	4	1327	August 4, 2023
Nomad Consul connect docker and system Consul	2	98	May 27, 2024

"Connection refused" to local Envoy sidecar

Related topics