Single Consul Datacenter in Multiple Kubernetes Clusters Connection Failure

Hi Folks,
I’m hoping someone can give me a few pointers here!

I’ve been running Consul and Nomad in a homelab environment for the last year for service discovery and monitoring and it’s all being working fine.

I’m in the process of introduing Kubernetes into the mix and have hit a wall with the static-server and static-client test deployement.

Due to my inxperience with the Connect Service Mesh I can’t tell if its an issue with Consul in general of a specific issue with Connect / Envoy in my particular environemnt.

I’ve set up two fresh VMS on the same host to implement the Single Consul Datacenter in Multiple Kubernetes Clusters.

I have set them up as dc2 as not to conflict with my existing Consul / Nomad set up which is dc1. I have no intection of connecting them. dc2 is purely a test environment.

  • k3s-mesh-server is Cluster1 (with static-server and static-client)
  • k3s-mesh-client is Cluster2 (just static-client)

I’ve followed the tutorial practically verbatim with a few tweaks to the helm values to get the Cloud Auto-join to work


  datacenter: dc2
    enabled: true
    enableAutoEncrypt: true
    manageSystemACLs: true
    secretName: consul-gossip-encryption-key
    secretKey: key
  enabled: true
  replicas: 1
  logLevel: debug
  enabled: true
  replicas: 1
  enabled: true
  replicas: 1
  storage: 2Gi
### adding this as the k8s joiner will require access to connect from a different cluster
  exposeGossipAndRPCPorts: true
      port: 9301
  annotations: |
    "": "9301"
    type: NodePort


  enabled: false
  datacenter: dc2
    manageSystemACLs: true
      secretName: cluster1-consul-bootstrap-acl-token
      secretKey: token
    secretName: consul-gossip-encryption-key
    secretKey: key
    enabled: true
    enableAutoEncrypt: true
      secretName: cluster1-consul-ca-cert
      secretKey: tls.crt
  enabled: true
  # This should be any node IP of the first k8s cluster 
  hosts: [""]
  # The node port of the UI's NodePort service
  httpsPort: 32033
  tlsServerName: server.dc2.consul
  # The address of the kube API server of this Kubernetes cluster
  k8sAuthMethodHost: https://k3s-mesh-client:6443
  enabled: true
  exposeGossipPorts: true
  join: ["provider=k8s kubeconfig=/consul/userconfig/cluster1-kubeconfig/kubeconfig host_network=true namespace=\"consul\" label_selector=\"app=consul,component=server\""]
  extraConfig: |
                "log_level": "TRACE"
    - type: secret
      name: cluster1-kubeconfig
      load: false
  enabled: true
  replicas: 1
  logLevel: debug

The Client connects to the Server with some complaints

2022-04-28T11:19:47.136Z [TRACE] agent.tlsutil: IncomingHTTPSConfig: version=5
2022-04-28T11:19:47.138Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from= latency=332.384µs
2022-04-28T11:19:47.271Z [DEBUG] agent.client.memberlist.lan: memberlist: Stream connection from=
2022-04-28T11:19:47.496Z [DEBUG] agent.client.memberlist.lan: memberlist: Stream connection from=
2022-04-28T11:19:47.635Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: cluster1-consul-server-0 (timeout reached)
2022-04-28T11:19:48.135Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to cluster1-consul-server-0 but other probes failed, network may be misconfigured
2022-04-28T11:19:49.272Z [DEBUG] agent.client.memberlist.lan: memberlist: Stream connection from=
2022-04-28T11:19:49.635Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: cluster1-consul-server-0 (timeout reached)
2022-04-28T11:19:49.638Z [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with: cluster1-consul-server-0
2022-04-28T11:19:49.638Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: k3s-mesh-server

Despite this the cli lists them as connected

Node                      Address               Status  Type    Build   Protocol  DC   Partition  Segment
cluster1-consul-server-0  alive   server  1.12.0  2         dc2  default    <all>
k3s-mesh-client    alive   client  1.12.0  2         dc2  default    <default>
k3s-mesh-server        alive   client  1.12.0  2         dc2  default    <default>

So I ploughed on and deployed 1 static-server to Cluster1 and a static-client to each of Cluster1 and Cluster2. On both clusters these are in a namespace called testing.

All three show up correctly in the consul ui


Despite this curl fails on both clients in the same manner:

 $ curl localhost:1234 -vv
*   Trying
* Connected to localhost ( port 1234 (#0)
> GET / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/7.83.0-DEV
> Accept: */*
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

Envoy trace log excerpt:

[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/] [C1496] connecting to
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/] [C1496] connection in progress
[2022-04-28 11:50:34.650][13][trace][pool] [source/common/conn_pool/] not creating a new connection, shouldCreateNewConnection returned false.
[2022-04-28 11:50:34.650][13][debug][conn_handler] [source/server/] [C1495] new connection from
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/] [C1495] socket event: 2
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/] [C1495] write ready
[2022-04-28 11:50:34.650][13][trace][connection] [source/extensions/transport_sockets/tls/] [C1495] ssl error occurred while read: SYSCALL
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/] [C1495] closing socket: 0
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/] [C1495] raising connection event 0
[2022-04-28 11:50:34.650][13][trace][filter] [source/common/tcp_proxy/] [C1495] on downstream event 0, has upstream = true
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/] cancelling pending stream
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/] [C1496] closing data_to_write=0 type=1
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/] [C1496] closing socket: 1
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/] [C1496] raising connection event 1
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/] [C1496] client disconnected, failure reason: 
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/] item added to deferred deletion list (size=1)
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/] invoking idle callbacks - is_draining_for_deletion_=false
[2022-04-28 11:50:34.650][13][trace][upstream] [source/common/upstream/] Idle pool, erasing pool for host 0x54603f486380
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/] item added to deferred deletion list (size=2)
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/] invoking idle callbacks - is_draining_for_deletion_=false
[2022-04-28 11:50:34.650][13][trace][conn_handler] [source/server/] [C1495] connection on event 0
[2022-04-28 11:50:34.650][13][debug][conn_handler] [source/server/] [C1495] adding to cleanup list
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/] item added to deferred deletion list (size=3)
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/] item added to deferred deletion list (size=4)
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/] clearing deferred deletion list (size=4)

The only parts that really stick out to me are:

[2022-04-28 11:50:34.650][13][trace][connection] [source/extensions/transport_sockets/tls/] [C1495] ssl error occurred while read: SYSCALL


[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/] [C1496] client disconnected, failure reason: 

but I’m at a loss to what that means exactly or if its the culprit.

While the communication between clusters could fail for several reasons, I’d expect the static-server and static-client in the same cluster to function as expected.

Can anyone shed some light on this, or suggest specific troubleshooting I can perform?

Ok, I deleted the whole stack and redployed it. In the Helm values I changed the domain to

  domain: jtec
  datacenter: dc2

to ensure that there wasn’t some leakage between the two consul deployments on the same network.

Both Clusters have stopped complaining that

agent.client.memberlist.lan: memberlist: Was able to connect to XXXX but other probes failed, network may be misconfigured

The static-client adjacent to static-server on Cluster1, now performs as expected:

/ $ curl localhost:1234 -vv
*   Trying
* Connected to localhost ( port 1234 (#0)
> GET / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/7.83.0-DEV
> Accept: */*
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< X-App-Name: http-echo
< X-App-Version: 0.2.3
< Date: Thu, 28 Apr 2022 14:32:07 GMT
< Content-Length: 14
< Content-Type: text/plain; charset=utf-8
"hello world"
* Connection #0 to host localhost left intact

The static-client on Cluster2 is still drawing a blank

/ $ curl localhost:1234 -vv
*   Trying
* Connected to localhost ( port 1234 (#0)
> GET / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/7.83.0-DEV
> Accept: */*
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

Hey @jmacdoo

That tutorial requires a flat node and pod network. Is that the case in your setup?

hi @ishustava1,

Correct. All k8s hosts are all on a single subnet, without any form of segmentation.

Hmmm have I fallen at the first hurdle and missunderstood the meaning of “a flat node and pod network”? :laughing: