Single Consul Datacenter in Multiple Kubernetes Clusters Connection Failure

Hi Folks,
I’m hoping someone can give me a few pointers here!

I’ve been running Consul and Nomad in a homelab environment for the last year for service discovery and monitoring and it’s all being working fine.

I’m in the process of introduing Kubernetes into the mix and have hit a wall with the static-server and static-client test deployement.

Due to my inxperience with the Connect Service Mesh I can’t tell if its an issue with Consul in general of a specific issue with Connect / Envoy in my particular environemnt.

I’ve set up two fresh VMS on the same host to implement the Single Consul Datacenter in Multiple Kubernetes Clusters.

I have set them up as dc2 as not to conflict with my existing Consul / Nomad set up which is dc1. I have no intection of connecting them. dc2 is purely a test environment.

  • k3s-mesh-server is Cluster1 (with static-server and static-client)
  • k3s-mesh-client is Cluster2 (just static-client)

I’ve followed the tutorial practically verbatim with a few tweaks to the helm values to get the Cloud Auto-join to work

Cluster1

global:
  datacenter: dc2
  tls:
    enabled: true
    enableAutoEncrypt: true
  acls:
    manageSystemACLs: true
  gossipEncryption:
    secretName: consul-gossip-encryption-key
    secretKey: key
connectInject:
  enabled: true
  replicas: 1
  logLevel: debug
controller:
  enabled: true
  replicas: 1
server:
  enabled: true
  replicas: 1
  storage: 2Gi
### adding this as the k8s joiner will require access to connect from a different cluster
  exposeGossipAndRPCPorts: true
  ports:
    serflan:
      port: 9301
  annotations: |
    "consul.hashicorp.com/auto-join-port": "9301"
ui:
  service:
    type: NodePort

Cluster2

global:
  enabled: false
  datacenter: dc2
  acls:
    manageSystemACLs: true
    bootstrapToken:
      secretName: cluster1-consul-bootstrap-acl-token
      secretKey: token
  gossipEncryption:
    secretName: consul-gossip-encryption-key
    secretKey: key
  tls:
    enabled: true
    enableAutoEncrypt: true
    caCert:
      secretName: cluster1-consul-ca-cert
      secretKey: tls.crt
externalServers:
  enabled: true
  # This should be any node IP of the first k8s cluster 
  hosts: ["192.168.178.103"]
  # The node port of the UI's NodePort service
  httpsPort: 32033
  tlsServerName: server.dc2.consul
  # The address of the kube API server of this Kubernetes cluster
  k8sAuthMethodHost: https://k3s-mesh-client:6443
client:
  enabled: true
  exposeGossipPorts: true
  join: ["provider=k8s kubeconfig=/consul/userconfig/cluster1-kubeconfig/kubeconfig host_network=true namespace=\"consul\" label_selector=\"app=consul,component=server\""]
  extraConfig: |
              {
                "log_level": "TRACE"
              }
  extraVolumes:
    - type: secret
      name: cluster1-kubeconfig
      load: false
connectInject:
  enabled: true
  replicas: 1
  logLevel: debug

The Client connects to the Server with some complaints

2022-04-28T11:19:47.136Z [TRACE] agent.tlsutil: IncomingHTTPSConfig: version=5
2022-04-28T11:19:47.138Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from=127.0.0.1:48148 latency=332.384µs
2022-04-28T11:19:47.271Z [DEBUG] agent.client.memberlist.lan: memberlist: Stream connection from=192.168.178.103:1605
2022-04-28T11:19:47.496Z [DEBUG] agent.client.memberlist.lan: memberlist: Stream connection from=192.168.178.103:13497
2022-04-28T11:19:47.635Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: cluster1-consul-server-0 (timeout reached)
2022-04-28T11:19:48.135Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to cluster1-consul-server-0 but other probes failed, network may be misconfigured
2022-04-28T11:19:49.272Z [DEBUG] agent.client.memberlist.lan: memberlist: Stream connection from=192.168.178.103:62091
2022-04-28T11:19:49.635Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: cluster1-consul-server-0 (timeout reached)
2022-04-28T11:19:49.638Z [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with: cluster1-consul-server-0 192.168.178.103:9301
2022-04-28T11:19:49.638Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: k3s-mesh-server 10.42.0.79

Despite this the cli lists them as connected

Node                      Address               Status  Type    Build   Protocol  DC   Partition  Segment
cluster1-consul-server-0  192.168.178.103:9301  alive   server  1.12.0  2         dc2  default    <all>
k3s-mesh-client           192.168.178.40:8301   alive   client  1.12.0  2         dc2  default    <default>
k3s-mesh-server           10.42.0.79:8301       alive   client  1.12.0  2         dc2  default    <default>

So I ploughed on and deployed 1 static-server to Cluster1 and a static-client to each of Cluster1 and Cluster2. On both clusters these are in a namespace called testing.

All three show up correctly in the consul ui

image

Despite this curl fails on both clients in the same manner:

 $ curl localhost:1234 -vv
*   Trying 127.0.0.1:1234...
* Connected to localhost (127.0.0.1) port 1234 (#0)
> GET / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/7.83.0-DEV
> Accept: */*
> 
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

Envoy trace log excerpt:

[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/connection_impl.cc:912] [C1496] connecting to 127.0.0.1:0
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/connection_impl.cc:931] [C1496] connection in progress
[2022-04-28 11:50:34.650][13][trace][pool] [source/common/conn_pool/conn_pool_base.cc:131] not creating a new connection, shouldCreateNewConnection returned false.
[2022-04-28 11:50:34.650][13][debug][conn_handler] [source/server/active_tcp_listener.cc:142] [C1495] new connection from 10.42.0.57:51836
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/connection_impl.cc:563] [C1495] socket event: 2
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/connection_impl.cc:674] [C1495] write ready
[2022-04-28 11:50:34.650][13][trace][connection] [source/extensions/transport_sockets/tls/ssl_handshaker.cc:52] [C1495] ssl error occurred while read: SYSCALL
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/connection_impl.cc:250] [C1495] closing socket: 0
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/connection_impl.cc:418] [C1495] raising connection event 0
[2022-04-28 11:50:34.650][13][trace][filter] [source/common/tcp_proxy/tcp_proxy.cc:587] [C1495] on downstream event 0, has upstream = true
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:600] cancelling pending stream
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/connection_impl.cc:139] [C1496] closing data_to_write=0 type=1
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/connection_impl.cc:250] [C1496] closing socket: 1
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/connection_impl.cc:418] [C1496] raising connection event 1
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:439] [C1496] client disconnected, failure reason: 
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/dispatcher_impl.cc:249] item added to deferred deletion list (size=1)
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:410] invoking idle callbacks - is_draining_for_deletion_=false
[2022-04-28 11:50:34.650][13][trace][upstream] [source/common/upstream/cluster_manager_impl.cc:1807] Idle pool, erasing pool for host 0x54603f486380
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/dispatcher_impl.cc:249] item added to deferred deletion list (size=2)
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:410] invoking idle callbacks - is_draining_for_deletion_=false
[2022-04-28 11:50:34.650][13][trace][conn_handler] [source/server/active_stream_listener_base.cc:111] [C1495] connection on event 0
[2022-04-28 11:50:34.650][13][debug][conn_handler] [source/server/active_stream_listener_base.cc:120] [C1495] adding to cleanup list
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/dispatcher_impl.cc:249] item added to deferred deletion list (size=3)
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/dispatcher_impl.cc:249] item added to deferred deletion list (size=4)
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/dispatcher_impl.cc:125] clearing deferred deletion list (size=4)

The only parts that really stick out to me are:

[2022-04-28 11:50:34.650][13][trace][connection] [source/extensions/transport_sockets/tls/ssl_handshaker.cc:52] [C1495] ssl error occurred while read: SYSCALL

and

[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:439] [C1496] client disconnected, failure reason: 

but I’m at a loss to what that means exactly or if its the culprit.

While the communication between clusters could fail for several reasons, I’d expect the static-server and static-client in the same cluster to function as expected.

Can anyone shed some light on this, or suggest specific troubleshooting I can perform?

Ok, I deleted the whole stack and redployed it. In the Helm values I changed the domain to

global:
  domain: jtec
  datacenter: dc2

to ensure that there wasn’t some leakage between the two consul deployments on the same network.

Both Clusters have stopped complaining that

agent.client.memberlist.lan: memberlist: Was able to connect to XXXX but other probes failed, network may be misconfigured

The static-client adjacent to static-server on Cluster1, now performs as expected:

/ $ curl localhost:1234 -vv
*   Trying 127.0.0.1:1234...
* Connected to localhost (127.0.0.1) port 1234 (#0)
> GET / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/7.83.0-DEV
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< X-App-Name: http-echo
< X-App-Version: 0.2.3
< Date: Thu, 28 Apr 2022 14:32:07 GMT
< Content-Length: 14
< Content-Type: text/plain; charset=utf-8
< 
"hello world"
* Connection #0 to host localhost left intact

The static-client on Cluster2 is still drawing a blank

/ $ curl localhost:1234 -vv
*   Trying 127.0.0.1:1234...
* Connected to localhost (127.0.0.1) port 1234 (#0)
> GET / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/7.83.0-DEV
> Accept: */*
> 
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

Hey @jmacdoo

That tutorial requires a flat node and pod network. Is that the case in your setup?

hi @ishustava1,

Correct. All k8s hosts are all on a single subnet, 192.168.178.0/24 without any form of segmentation.

Hmmm have I fallen at the first hurdle and missunderstood the meaning of “a flat node and pod network”? :laughing: