"Connection refused" to local Envoy sidecar

I’m about to lose my mind.

My downstream service, for whatever reason, cannot connect Envoy to reach the upstream:

time="2020-09-23T22:46:47Z" level=info msg="Connecting to database at: bolt://127.0.0.1:7687"
time="2020-09-23T22:46:47Z" level=warning msg="Connection error: dial tcp 127.0.0.1:7687: connect: connection refused"

Here are my service definitions in Nomad:

Upstream:

    service {
      name = "${BASE}-bolt-internal"
      port = "bolt"
      connect {
        sidecar_service {}
      }
    }

Downstream:

    service {
      name = "${BASE}-api"
      port = "web"
      check {
        type     = "http"
        port     = "web"
        path     = "/"
        interval = "60s"
        timeout  = "10s"
      }
      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "jnexus-database-group-bolt-internal"
              local_bind_port  = 7687
            }
          }
        }
      }
    }

Sidecar logs from the downstream don’t show anything abnormal. The listener starts and then the sidecar gets terminated because the main process dies.

[2020-09-24 02:27:32.674][1][info][main] [source/server/server.cc:500] all clusters initialized. initializing init manager
[2020-09-24 02:27:32.681][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'public_listener:0.0.0.0:31919'
[2020-09-24 02:27:32.682][1][info][upstream] [source/server/lds_api.cc:60] lds: add/update listener 'jnexus-database-group-bolt-internal:127.0.0.1:7687'
[2020-09-24 02:27:32.682][1][info][config] [source/server/listener_manager_impl.cc:761] all dependencies initialized. starting workers
[2020-09-24 02:28:15.264][1][warning][main] [source/server/server.cc:468] caught SIGINT
[2020-09-24 02:28:15.264][1][info][main] [source/server/server.cc:567] shutting down server instance
[2020-09-24 02:28:15.264][1][info][main] [source/server/server.cc:521] main dispatch loop exited
[2020-09-24 02:28:15.265][1][info][main] [source/server/server.cc:560] exiting

Are you sure that ${BASE} would resolve to jnexus-database-group?
Why not try a static name first, to see whether it will work or not?

Nope, same problem.

I should also mention that I have Consul ACLs enabled and intentions configured between the services.

Did you give nomad enough access to register the services to consul?

You have to create token for nomad server and client.

Example

agent_prefix "" {
  policy = "read"
}

node_prefix "" {
  policy = "read"
}

service_prefix "" {
  policy = "write"
}

acl = "write"

Read it here.

Alternatively, are you 100% sure that the shutdown is caused by not finding services on port 7689? It might be some other problems causing SIGINT.

Yes. The upstream shows as registered in Consul and passes health checks, and Nomad shows that the sidecar is getting terminated because the task it’s attached to has failed.

Boy, was this an adventure to figure out.

TL;DR - Don’t mix Nomad’s and Docker’s bridge modes - they’re different.

I had followed this guide, which details how to configure Consul DNS to work inside Docker containers by creating a dummy network interface that can passed to the Docker --dns option.

The --dns option works just fine with Docker’s bridge mode. It does not work with Nomad’s bridge mode - because Nomad’s bridge mode sets the Docker networking mode to none so the CNI plugin can be used instead. If you try and specify config.dns_servers for a Nomad task that’s operating in Nomad bridge mode, you’ll get an error from Docker saying that dns_servers is incompatible with your networking mode.

So of course, when I got this error, I simply set Docker’s config.networking_mode to bridge for the task, the error disappeared, and I went on my merry way.

Unbeknownst to me, this had switched me away from using the CNI plugin, and I was now in Docker’s bridge mode in the default bridge network.

And, well - in that mode, 127.0.0.1 is not shared between containers… so when my service tried to reach the sidecar, it was trying to connect back to its own container, where Envoy isn’t running.