Consul service mesh connectivity issues

Hi All,

We experience issues trying to have 2 services (prometheus -> alertmanager) communicate over consul service mesh.

Our setup:

Using the official consul helm chart I have consul wan federation setup over 2 consul clusters running on 2 separate kubernetes clusters, each in a different datacenter. Wan federation works. I also enabled consul connect / service mesh. This however does not work. I have a prometheus pod (with a consul-connect-envoy-sidecar) in the one datacenter and an alertmanager pod in the other datacenter. I used annotations to enable and configure connect for those pods. In the consul ui I can see the prometheus pod has the backend pod as an upstream, in consul ui all healthchecks are green, i.e. for the client & backend services, for the sidecar proxies of those services and for the meshgateways.

In the logs I can see that the connection request from prometheus towards the alertmanager reaches the meshgateway in the datacenter on the alertmanager side and from there on reaches the consul-connect-envoy-sidecar inside the alertmanager pod.

meshgateway

[2019-11-08 17:19:48.962][29][debug][filter] [source/extensions/filters/listener/tls_inspector/tls_inspector.cc:72] tls inspector: new connection accepted
[2019-11-08 17:19:48.962][29][debug][filter] [source/extensions/filters/listener/tls_inspector/tls_inspector.cc:118] tls:onServerName(), requestedServerName: alertmanager.default.datacenter-B-gcp-europe-west6.internal.72300fbd-d88c-f97d-cd40-16502cb362ea.consul
[2019-11-08 17:19:48.962][29][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:200] [C70322] new tcp proxy session
[2019-11-08 17:19:48.962][29][debug][connection] [source/common/network/connection_impl.cc:101] [C70322] closing data_to_write=0 type=1
[2019-11-08 17:19:48.962][29][debug][connection] [source/common/network/connection_impl.cc:183] [C70322] closing socket: 1

consul-connect-envoy-sidecar

[2019-11-08 17:30:56.163][000001][debug][main] [source/server/server.cc:143] flushing stats
[2019-11-08 17:30:56.259][000012][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:200] [C263736] new tcp proxy session
[2019-11-08 17:30:56.259][000012][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:342] [C263736] Creating connection to cluster local_app
[2019-11-08 17:30:56.259][000012][debug][pool] [source/common/tcp/conn_pool.cc:80] creating a new connection
[2019-11-08 17:30:56.259][000012][debug][pool] [source/common/tcp/conn_pool.cc:371] [C263737] connecting
[2019-11-08 17:30:56.259][000012][debug][connection] [source/common/network/connection_impl.cc:634] [C263737] connecting to 127.0.0.1:9093
[2019-11-08 17:30:56.260][000012][debug][connection] [source/common/network/connection_impl.cc:643] [C263737] connection in progress
[2019-11-08 17:30:56.260][000012][debug][pool] [source/common/tcp/conn_pool.cc:106] queueing request due to no available connections
[2019-11-08 17:30:56.260][000012][debug][main] [source/server/connection_handler_impl.cc:257] [C263736] new connection
[2019-11-08 17:30:56.260][000012][debug][connection] [source/common/ssl/ssl_socket.cc:135] [C263736] handshake error: 2
[2019-11-08 17:30:56.261][000012][debug][connection] [source/common/network/connection_impl.cc:516] [C263737] connected
[2019-11-08 17:30:56.261][000012][debug][pool] [source/common/tcp/conn_pool.cc:292] [C263737] assigning connection
[2019-11-08 17:30:56.261][000012][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:540] TCP:onUpstreamEvent(), requestedServerName:
[2019-11-08 17:30:56.261][000012][debug][connection] [source/common/ssl/ssl_socket.cc:135] [C263736] handshake error: 5
[2019-11-08 17:30:56.261][000012][debug][connection] [source/common/network/connection_impl.cc:183] [C263736] closing socket: 0
[2019-11-08 17:30:56.261][000012][debug][connection] [source/common/network/connection_impl.cc:101] [C263737] closing data_to_write=0 type=0
[2019-11-08 17:30:56.261][000012][debug][connection] [source/common/network/connection_impl.cc:183] [C263737] closing socket: 1
[2019-11-08 17:30:56.261][000012][debug][pool] [source/common/tcp/conn_pool.cc:121] [C263737] client disconnected
[2019-11-08 17:30:56.261][000012][debug][main] [source/server/connection_handler_impl.cc:68] [C263736] adding to cleanup list
[2019-11-08 17:30:56.262][000012][debug][pool] [source/common/tcp/conn_pool.cc:245] [C263737] connection destroyed
[2019-11-08 17:31:01.163][000001][debug][main] [source/server/server.cc:143] flushing stats
[2019-11-08 17:31:06.164][000001][debug][main] [source/server/server.cc:143] flushing stats
[2019-11-08 17:31:06.262][000012][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:200] [C263738] new tcp proxy session
[2019-11-08 17:31:06.262][000012][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:342] [C263738] Creating connection to cluster local_app
[2019-11-08 17:31:06.262][000012][debug][pool] [source/common/tcp/conn_pool.cc:80] creating a new connection
[2019-11-08 17:31:06.262][000012][debug][pool] [source/common/tcp/conn_pool.cc:371] [C263739] connecting
[2019-11-08 17:31:06.262][000012][debug][connection] [source/common/network/connection_impl.cc:634] [C263739] connecting to 127.0.0.1:9093
[2019-11-08 17:31:06.262][000012][debug][connection] [source/common/network/connection_impl.cc:643] [C263739] connection in progress
[2019-11-08 17:31:06.262][000012][debug][pool] [source/common/tcp/conn_pool.cc:106] queueing request due to no available connections
[2019-11-08 17:31:06.262][000012][debug][main] [source/server/connection_handler_impl.cc:257] [C263738] new connection
[2019-11-08 17:31:06.262][000012][debug][connection] [source/common/ssl/ssl_socket.cc:135] [C263738] handshake error: 2
[2019-11-08 17:31:06.262][000012][debug][connection] [source/common/network/connection_impl.cc:516] [C263739] connected
[2019-11-08 17:31:06.262][000012][debug][pool] [source/common/tcp/conn_pool.cc:292] [C263739] assigning connection
[2019-11-08 17:31:06.262][000012][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:540] TCP:onUpstreamEvent(), requestedServerName:
[2019-11-08 17:31:06.262][000012][debug][connection] [source/common/ssl/ssl_socket.cc:135] [C263738] handshake error: 2
[2019-11-08 17:31:06.263][000012][debug][connection] [source/common/ssl/ssl_socket.cc:135] [C263738] handshake error: 5
[2019-11-08 17:31:06.263][000012][debug][connection] [source/common/network/connection_impl.cc:183] [C263738] closing socket: 0
[2019-11-08 17:31:06.263][000012][debug][connection] [source/common/network/connection_impl.cc:101] [C263739] closing data_to_write=0 type=0
[2019-11-08 17:31:06.263][000012][debug][connection] [source/common/network/connection_impl.cc:183] [C263739] closing socket: 1
[2019-11-08 17:31:06.263][000012][debug][pool] [source/common/tcp/conn_pool.cc:121] [C263739] client disconnected
[2019-11-08 17:31:06.263][000012][debug][main] [source/server/connection_handler_impl.cc:68] [C263738] adding to cleanup list
[2019-11-08 17:31:06.263][000012][debug][pool] [source/common/tcp/conn_pool.cc:245] [C263739] connection destroyed
[2019-11-08 17:31:11.165][000001][debug][main] [source/server/server.cc:143] flushing stats
[2019-11-08 17:31:16.165][000001][debug][main] [source/server/server.cc:143] flushing stats
[2019-11-08 17:31:16.264][000013][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:200] [C263740] new tcp proxy session
[2019-11-08 17:31:16.264][000013][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:342] [C263740] Creating connection to cluster local_app
[2019-11-08 17:31:16.264][000013][debug][pool] [source/common/tcp/conn_pool.cc:80] creating a new connection
[2019-11-08 17:31:16.264][000013][debug][pool] [source/common/tcp/conn_pool.cc:371] [C263741] connecting
[2019-11-08 17:31:16.264][000013][debug][connection] [source/common/network/connection_impl.cc:634] [C263741] connecting to 127.0.0.1:9093
[2019-11-08 17:31:16.265][000013][debug][connection] [source/common/network/connection_impl.cc:643] [C263741] connection in progress
[2019-11-08 17:31:16.265][000013][debug][pool] [source/common/tcp/conn_pool.cc:106] queueing request due to no available connections
[2019-11-08 17:31:16.265][000013][debug][main] [source/server/connection_handler_impl.cc:257] [C263740] new connection
[2019-11-08 17:31:16.265][000013][debug][connection] [source/common/ssl/ssl_socket.cc:135] [C263740] handshake error: 2
[2019-11-08 17:31:16.265][000013][debug][connection] [source/common/network/connection_impl.cc:516] [C263741] connected
[2019-11-08 17:31:16.265][000013][debug][pool] [source/common/tcp/conn_pool.cc:292] [C263741] assigning connection
[2019-11-08 17:31:16.265][000013][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:540] TCP:onUpstreamEvent(), requestedServerName:
[2019-11-08 17:31:16.265][000013][debug][connection] [source/common/ssl/ssl_socket.cc:135] [C263740] handshake error: 2
[2019-11-08 17:31:16.265][000013][debug][connection] [source/common/ssl/ssl_socket.cc:135] [C263740] handshake error: 5
[2019-11-08 17:31:16.265][000013][debug][connection] [source/common/network/connection_impl.cc:183] [C263740] closing socket: 0
[2019-11-08 17:31:16.266][000013][debug][connection] [source/common/network/connection_impl.cc:101] [C263741] closing data_to_write=0 type=0
[2019-11-08 17:31:16.266][000013][debug][connection] [source/common/network/connection_impl.cc:183] [C263741] closing socket: 1
[2019-11-08 17:31:16.266][000013][debug][pool] [source/common/tcp/conn_pool.cc:121] [C263741] client disconnected
[2019-11-08 17:31:16.266][000013][debug][main] [source/server/connection_handler_impl.cc:68] [C263740] adding to cleanup list
[2019-11-08 17:31:16.266][000013][debug][pool] [source/common/tcp/conn_pool.cc:245] [C263741] connection destroyed
[2019-11-08 17:31:21.167][000001][debug][main] [source/server/server.cc:143] flushing stats
[2019-11-08 17:31:26.168][000001][debug][main] [source/server/server.cc:143] flushing stats
[2019-11-08 17:31:26.267][000013][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:200] [C263742] new tcp proxy session
[2019-11-08 17:31:26.267][000013][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:342] [C263742] Creating connection to cluster local_app
[2019-11-08 17:31:26.267][000013][debug][pool] [source/common/tcp/conn_pool.cc:80] creating a new connection
[2019-11-08 17:31:26.267][000013][debug][pool] [source/common/tcp/conn_pool.cc:371] [C263743] connecting
[2019-11-08 17:31:26.267][000013][debug][connection] [source/common/network/connection_impl.cc:634] [C263743] connecting to 127.0.0.1:9093
[2019-11-08 17:31:26.268][000013][debug][connection] [source/common/network/connection_impl.cc:643] [C263743] connection in progress
[2019-11-08 17:31:26.268][000013][debug][pool] [source/common/tcp/conn_pool.cc:106] queueing request due to no available connections
[2019-11-08 17:31:26.268][000013][debug][main] [source/server/connection_handler_impl.cc:257] [C263742] new connection
[2019-11-08 17:31:26.268][000013][debug][connection] [source/common/ssl/ssl_socket.cc:135] [C263742] handshake error: 5
[2019-11-08 17:31:26.268][000013][debug][connection] [source/common/network/connection_impl.cc:183] [C263742] closing socket: 0
[2019-11-08 17:31:26.269][000013][debug][pool] [source/common/tcp/conn_pool.cc:213] canceling pending request
[2019-11-08 17:31:26.269][000013][debug][pool] [source/common/tcp/conn_pool.cc:221] canceling pending connection
[2019-11-08 17:31:26.269][000013][debug][connection] [source/common/network/connection_impl.cc:101] [C263743] closing data_to_write=0 type=1
[2019-11-08 17:31:26.269][000013][debug][connection] [source/common/network/connection_impl.cc:183] [C263743] closing socket: 1
[2019-11-08 17:31:26.269][000013][debug][pool] [source/common/tcp/conn_pool.cc:121] [C263743] client disconnected
[2019-11-08 17:31:26.269][000013][debug][main] [source/server/connection_handler_impl.cc:68] [C263742] adding to cleanup list
[2019-11-08 17:31:26.269][000013][debug][pool] [source/common/tcp/conn_pool.cc:245] [C263743] connection destroyed
[2019-11-08 17:31:31.169][000001][debug][main] [source/server/server.cc:143] flushing stats

There are no consul intentions blocking this.

What are we missing?

It looks like it’s getting a handshake error connecting to alertmanager running on 127.0.0.1:9093 which is weird because at this point the request should be over HTTP, not SSL.

If you kubectl exec into the alertmanager pod and run curl 127.0.0.1:9093/yourpath does it succeed?

Not behind a computer right now but yes that works.

A wget from the consul-connect-envoy-sidecar container in the alertmanager pod also works:

/ $ wget -qO- 127.0.0.1:9093/api/v1/alerts
{“status”:“success”,“data”:[{“labels”:{“alertname”:“CPUThrottlingHigh”,“container_name”:“prometheus-to-sd”,“namespace”:“kube-system”,“pod_name”:“prometheus-to-sd-6jbcd”,“prometheus”:“monitoring/rootlease-prometheus-opera-prometheus”,“severity”:“warning”},“annotations”:{“message”:“75% throttling of CPU in namespace kube-system for container prometheus-to-sd in pod prometheus-to-sd-6jbcd.”,“runbook_url”:“https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh”}

It looks like the handshake error is not directly (perhaps indirectly) related since they continue to happen after stopping the prometheus pod:

[2019-11-13 13:29:00.932][000014][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:200] [C72] new tcp proxy session
[2019-11-13 13:29:00.932][000014][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:342] [C72] Creating connection to cluster local_app
[2019-11-13 13:29:00.932][000014][debug][pool] [source/common/tcp/conn_pool.cc:80] creating a new connection
[2019-11-13 13:29:00.932][000014][debug][pool] [source/common/tcp/conn_pool.cc:371] [C73] connecting
[2019-11-13 13:29:00.932][000014][debug][connection] [source/common/network/connection_impl.cc:634] [C73] connecting to 127.0.0.1:9093
[2019-11-13 13:29:00.933][000014][debug][connection] [source/common/network/connection_impl.cc:643] [C73] connection in progress
[2019-11-13 13:29:00.933][000014][debug][pool] [source/common/tcp/conn_pool.cc:106] queueing request due to no available connections
[2019-11-13 13:29:00.933][000014][debug][main] [source/server/connection_handler_impl.cc:257] [C72] new connection
[2019-11-13 13:29:00.933][000014][debug][connection] [source/common/ssl/ssl_socket.cc:135] [C72] handshake error: 2
[2019-11-13 13:29:00.933][000014][debug][connection] [source/common/network/connection_impl.cc:516] [C73] connected
[2019-11-13 13:29:00.933][000014][debug][pool] [source/common/tcp/conn_pool.cc:292] [C73] assigning connection
[2019-11-13 13:29:00.933][000014][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:540] TCP:onUpstreamEvent(), requestedServerName:
[2019-11-13 13:29:00.933][000014][debug][connection] [source/common/ssl/ssl_socket.cc:135] [C72] handshake error: 2
[2019-11-13 13:29:00.935][000014][debug][connection] [source/common/ssl/ssl_socket.cc:135] [C72] handshake error: 5
[2019-11-13 13:29:00.935][000014][debug][connection] [source/common/network/connection_impl.cc:183] [C72] closing socket: 0
[2019-11-13 13:29:00.935][000014][debug][connection] [source/common/network/connection_impl.cc:101] [C73] closing data_to_write=0 type=0
[2019-11-13 13:29:00.935][000014][debug][connection] [source/common/network/connection_impl.cc:183] [C73] closing socket: 1
[2019-11-13 13:29:00.935][000014][debug][pool] [source/common/tcp/conn_pool.cc:121] [C73] client disconnected
[2019-11-13 13:29:00.935][000014][debug][main] [source/server/connection_handler_impl.cc:68] [C72] adding to cleanup list
[2019-11-13 13:29:00.935][000014][debug][pool] [source/common/tcp/conn_pool.cc:245] [C73] connection destroyed

It looks like we have those handshake for every pod were we enable connect. Any idea where this is coming from?

Could it be the result of some healthcheck?
In any case all checks for alertmanager are green in consul-ui. (See screenshot)

So, based on the log output, the last thing I know for sure is that our request reaches the meshgateway on the alertmanager side, I have no idea what happens after that nor whether a envoy proxy session is actually started, the logs don’t show anything about it:

[2019-11-13 13:55:03.703][37][debug][filter] [source/extensions/filters/listener/tls_inspector/tls_inspector.cc:72] tls inspector: new connection accepted
[2019-11-13 13:55:03.703][37][trace][filter] [source/extensions/filters/listener/tls_inspector/tls_inspector.cc:141] tls inspector: recv: 240
[2019-11-13 13:55:03.703][37][debug][filter] [source/extensions/filters/listener/tls_inspector/tls_inspector.cc:118] tls:onServerName(), requestedServerName: alertmanager.default.rootlease-gcp-europe-west6.internal.72300fbd-d88c-f97d-cd40-16502cb362ea.consul
[2019-11-13 13:55:03.703][37][trace][filter] [source/extensions/filters/listener/tls_inspector/tls_inspector.cc:162] tls inspector: done: true
[2019-11-13 13:55:03.703][37][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:200] [C13862] new tcp proxy session
[2019-11-13 13:55:03.703][37][trace][connection] [source/common/network/connection_impl.cc:282] [C13862] readDisable: enabled=true disable=true
[2019-11-13 13:55:03.703][37][trace][filter] [source/extensions/filters/network/sni_cluster/sni_cluster.cc:16] [C13862] sni_cluster: new connection with server name alertmanager.default.rootlease-gcp-europe-west6.internal.72300fbd-d88c-f97d-cd40-16502cb362ea.consul
[2019-11-13 13:55:03.703][37][debug][connection] [source/common/network/connection_impl.cc:101] [C13862] closing data_to_write=0 type=1
[2019-11-13 13:55:03.703][37][debug][connection] [source/common/network/connection_impl.cc:183] [C13862] closing socket: 1
[2019-11-13 13:55:03.703][37][debug][main] [source/server/connection_handler_impl.cc:257] [C13862] new connection
[2019-11-13 13:55:03.703][37][trace][main] [source/common/event/dispatcher_impl.cc:133] item added to deferred deletion list (size=1)

What would be the next step in the flow?