Problems with Consul Connect + Mesh Gateways

Hello,

I am trying to deploy Connect Gateways in out testing environment [https://learn.hashicorp.com/consul/developer-mesh/connect-gateways] with plans to transfer to production, but a problem occurred and I cannot find any solution online for it.
Setup:

  • Two WAN connected datacenters [dc1, dc,2]
  • ACL enabled + replication
  • TLS enabled
  • Connect enabled
  • Envoy Mesh gateways deployed dc1 and dc2.
  • Connection in both directions for gateways + both healthy to each other in envoy checks.

We started local services for testing purposes (socat) and registered them with settings for mesh-gateway: local. We try to connect but no connection is able to be made. We inspected 4 proxy debug logs and on the local proxy for socat client (socat-web in dc1) there is a problem with the upstream proxy as it shows health_flags::/failed_eds_health. All health checks in both consul datacenters are passing and all other health-checks in envoys (gateway1, gateway2, etc) are healthy with only the upstream failing. We are unable to solve this issue and we believe it causes the problems, because when we try to connect we get “no healthy host for TCP connection pool” in envoy logs for upstream proxy.

I am uploading most of configuration with removed non-important values.
DC2 and secondary gateway are missing, but they are close to dc1 with different ip-s, etc.

consul-client.txt (775 Bytes)
consul-server.txt (1016 Bytes)
dc1-gateway.txt (273 Bytes)
socat.txt (297 Bytes)
socat-web.txt (377 Bytes)

Best regards, Kiril

Hi @ShadowSteps,

Can you share the Envoy proxy logs for the gateways & local proxies? Specifically it would be helpful to see what the messages that are output when you try to initiate a connection to the upstream service. That info may help with debugging.

Thanks.

Hi
I have the same issue as @ShadowSteps, with quite same settings.

Here are the logs, with both mesh-gateway and sidecar services run in debug mode:

mesh-gateway-primary.log.txt (51.6 KB)
mesh-gateway-secondary.log.txt (52.8 KB)
sidecar-socat-primary.log.txt (30.7 KB)
sidecar-web-secondary.log.txt (39.7 KB)

apart from the “no healthy host for TCP connection pool” line, I have not enough knowledge to interpret this logs…

If anybody can help, thanks in advance…

Additional info:

# consul --version
Consul v1.8.3
Revision a9322b9c7
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
# envoy --version

envoy  version: 1a0363c885c2dbb1e48b03847dbd706d1ba43eba/1.14.2/clean-getenvoy-fbeeb15-envoy/RELEASE/BoringSSL

Hey,

I was never able to resolve the problem in such composition but later I was able to do it without any problems when I installed the mesh gateway on the same machine as the consul-server. This way I was able to run connect without problems, but when mesh gateways and servers were on different machines I had this problem.

Best regards, Kiril

Hi,
Thanks for your feedback.

For my part, I have the issue even when the mesh gateway and consul server are on the same machine.

I ran into the same issue and after spending an unspecified amount of time troubleshooting, it ended up being the ServiceResolver. You have to create a ServiceResolver for the upstream service (along with proxydefaults) and envoy will populate the IP address of the MGW into the /clusters api