Single Consul Datacenter in Multiple Kubernetes Clusters Connection Failure

jmacdoo · April 28, 2022, 12:04pm

Hi Folks,
I’m hoping someone can give me a few pointers here!

I’ve been running Consul and Nomad in a homelab environment for the last year for service discovery and monitoring and it’s all being working fine.

I’m in the process of introduing Kubernetes into the mix and have hit a wall with the static-server and static-client test deployement.

Due to my inxperience with the Connect Service Mesh I can’t tell if its an issue with Consul in general of a specific issue with Connect / Envoy in my particular environemnt.

I’ve set up two fresh VMS on the same host to implement the Single Consul Datacenter in Multiple Kubernetes Clusters.

I have set them up as dc2 as not to conflict with my existing Consul / Nomad set up which is dc1. I have no intection of connecting them. dc2 is purely a test environment.

k3s-mesh-server is Cluster1 (with static-server and static-client)
k3s-mesh-client is Cluster2 (just static-client)

I’ve followed the tutorial practically verbatim with a few tweaks to the helm values to get the Cloud Auto-join to work

Cluster1

global:
  datacenter: dc2
  tls:
    enabled: true
    enableAutoEncrypt: true
  acls:
    manageSystemACLs: true
  gossipEncryption:
    secretName: consul-gossip-encryption-key
    secretKey: key
connectInject:
  enabled: true
  replicas: 1
  logLevel: debug
controller:
  enabled: true
  replicas: 1
server:
  enabled: true
  replicas: 1
  storage: 2Gi
### adding this as the k8s joiner will require access to connect from a different cluster
  exposeGossipAndRPCPorts: true
  ports:
    serflan:
      port: 9301
  annotations: |
    "consul.hashicorp.com/auto-join-port": "9301"
ui:
  service:
    type: NodePort

Cluster2

global:
  enabled: false
  datacenter: dc2
  acls:
    manageSystemACLs: true
    bootstrapToken:
      secretName: cluster1-consul-bootstrap-acl-token
      secretKey: token
  gossipEncryption:
    secretName: consul-gossip-encryption-key
    secretKey: key
  tls:
    enabled: true
    enableAutoEncrypt: true
    caCert:
      secretName: cluster1-consul-ca-cert
      secretKey: tls.crt
externalServers:
  enabled: true
  # This should be any node IP of the first k8s cluster 
  hosts: ["192.168.178.103"]
  # The node port of the UI's NodePort service
  httpsPort: 32033
  tlsServerName: server.dc2.consul
  # The address of the kube API server of this Kubernetes cluster
  k8sAuthMethodHost: https://k3s-mesh-client:6443
client:
  enabled: true
  exposeGossipPorts: true
  join: ["provider=k8s kubeconfig=/consul/userconfig/cluster1-kubeconfig/kubeconfig host_network=true namespace=\"consul\" label_selector=\"app=consul,component=server\""]
  extraConfig: |
              {
                "log_level": "TRACE"
              }
  extraVolumes:
    - type: secret
      name: cluster1-kubeconfig
      load: false
connectInject:
  enabled: true
  replicas: 1
  logLevel: debug

The Client connects to the Server with some complaints

2022-04-28T11:19:47.136Z [TRACE] agent.tlsutil: IncomingHTTPSConfig: version=5
2022-04-28T11:19:47.138Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from=127.0.0.1:48148 latency=332.384µs
2022-04-28T11:19:47.271Z [DEBUG] agent.client.memberlist.lan: memberlist: Stream connection from=192.168.178.103:1605
2022-04-28T11:19:47.496Z [DEBUG] agent.client.memberlist.lan: memberlist: Stream connection from=192.168.178.103:13497
2022-04-28T11:19:47.635Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: cluster1-consul-server-0 (timeout reached)
2022-04-28T11:19:48.135Z [WARN]  agent.client.memberlist.lan: memberlist: Was able to connect to cluster1-consul-server-0 but other probes failed, network may be misconfigured
2022-04-28T11:19:49.272Z [DEBUG] agent.client.memberlist.lan: memberlist: Stream connection from=192.168.178.103:62091
2022-04-28T11:19:49.635Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: cluster1-consul-server-0 (timeout reached)
2022-04-28T11:19:49.638Z [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with: cluster1-consul-server-0 192.168.178.103:9301
2022-04-28T11:19:49.638Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: k3s-mesh-server 10.42.0.79

Despite this the cli lists them as connected

Node                      Address               Status  Type    Build   Protocol  DC   Partition  Segment
cluster1-consul-server-0  192.168.178.103:9301  alive   server  1.12.0  2         dc2  default    <all>
k3s-mesh-client           192.168.178.40:8301   alive   client  1.12.0  2         dc2  default    <default>
k3s-mesh-server           10.42.0.79:8301       alive   client  1.12.0  2         dc2  default    <default>

So I ploughed on and deployed 1 static-server to Cluster1 and a static-client to each of Cluster1 and Cluster2. On both clusters these are in a namespace called testing.

All three show up correctly in the consul ui

Despite this curl fails on both clients in the same manner:

 $ curl localhost:1234 -vv
*   Trying 127.0.0.1:1234...
* Connected to localhost (127.0.0.1) port 1234 (#0)
> GET / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/7.83.0-DEV
> Accept: */*
> 
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

Envoy trace log excerpt:

[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/connection_impl.cc:912] [C1496] connecting to 127.0.0.1:0
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/connection_impl.cc:931] [C1496] connection in progress
[2022-04-28 11:50:34.650][13][trace][pool] [source/common/conn_pool/conn_pool_base.cc:131] not creating a new connection, shouldCreateNewConnection returned false.
[2022-04-28 11:50:34.650][13][debug][conn_handler] [source/server/active_tcp_listener.cc:142] [C1495] new connection from 10.42.0.57:51836
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/connection_impl.cc:563] [C1495] socket event: 2
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/connection_impl.cc:674] [C1495] write ready
[2022-04-28 11:50:34.650][13][trace][connection] [source/extensions/transport_sockets/tls/ssl_handshaker.cc:52] [C1495] ssl error occurred while read: SYSCALL
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/connection_impl.cc:250] [C1495] closing socket: 0
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/connection_impl.cc:418] [C1495] raising connection event 0
[2022-04-28 11:50:34.650][13][trace][filter] [source/common/tcp_proxy/tcp_proxy.cc:587] [C1495] on downstream event 0, has upstream = true
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:600] cancelling pending stream
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/connection_impl.cc:139] [C1496] closing data_to_write=0 type=1
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/connection_impl.cc:250] [C1496] closing socket: 1
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/connection_impl.cc:418] [C1496] raising connection event 1
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:439] [C1496] client disconnected, failure reason: 
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/dispatcher_impl.cc:249] item added to deferred deletion list (size=1)
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:410] invoking idle callbacks - is_draining_for_deletion_=false
[2022-04-28 11:50:34.650][13][trace][upstream] [source/common/upstream/cluster_manager_impl.cc:1807] Idle pool, erasing pool for host 0x54603f486380
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/dispatcher_impl.cc:249] item added to deferred deletion list (size=2)
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:410] invoking idle callbacks - is_draining_for_deletion_=false
[2022-04-28 11:50:34.650][13][trace][conn_handler] [source/server/active_stream_listener_base.cc:111] [C1495] connection on event 0
[2022-04-28 11:50:34.650][13][debug][conn_handler] [source/server/active_stream_listener_base.cc:120] [C1495] adding to cleanup list
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/dispatcher_impl.cc:249] item added to deferred deletion list (size=3)
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/dispatcher_impl.cc:249] item added to deferred deletion list (size=4)
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/dispatcher_impl.cc:125] clearing deferred deletion list (size=4)

The only parts that really stick out to me are:

[2022-04-28 11:50:34.650][13][trace][connection] [source/extensions/transport_sockets/tls/ssl_handshaker.cc:52] [C1495] ssl error occurred while read: SYSCALL

and

[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:439] [C1496] client disconnected, failure reason:

but I’m at a loss to what that means exactly or if its the culprit.

While the communication between clusters could fail for several reasons, I’d expect the static-server and static-client in the same cluster to function as expected.

Can anyone shed some light on this, or suggest specific troubleshooting I can perform?

jmacdoo · April 28, 2022, 2:34pm

Ok, I deleted the whole stack and redployed it. In the Helm values I changed the domain to

global:
  domain: jtec
  datacenter: dc2

to ensure that there wasn’t some leakage between the two consul deployments on the same network.

Both Clusters have stopped complaining that

agent.client.memberlist.lan: memberlist: Was able to connect to XXXX but other probes failed, network may be misconfigured

The static-client adjacent to static-server on Cluster1, now performs as expected:

/ $ curl localhost:1234 -vv
*   Trying 127.0.0.1:1234...
* Connected to localhost (127.0.0.1) port 1234 (#0)
> GET / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/7.83.0-DEV
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< X-App-Name: http-echo
< X-App-Version: 0.2.3
< Date: Thu, 28 Apr 2022 14:32:07 GMT
< Content-Length: 14
< Content-Type: text/plain; charset=utf-8
< 
"hello world"
* Connection #0 to host localhost left intact

The static-client on Cluster2 is still drawing a blank

/ $ curl localhost:1234 -vv
*   Trying 127.0.0.1:1234...
* Connected to localhost (127.0.0.1) port 1234 (#0)
> GET / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/7.83.0-DEV
> Accept: */*
> 
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

ishustava1 · April 28, 2022, 3:13pm

Hey @jmacdoo

That tutorial requires a flat node and pod network. Is that the case in your setup?

jmacdoo · April 28, 2022, 3:34pm

hi @ishustava1,

Correct. All k8s hosts are all on a single subnet, 192.168.178.0/24 without any form of segmentation.

Hmmm have I fallen at the first hurdle and missunderstood the meaning of “a flat node and pod network”?

Topic		Replies	Views
Getting Issue in Deploy Single Consul Datacenter Across Multiple Kubernetes Clusters Consul k8s , connect , consul	1	376	April 4, 2023
Connecting K8s and Nomad using a single Consul Server (DC1). Is this even possible or what is the next best way to do so? Consul k8s , connect , consul-nomad , consul , consul-k8s , nomad	16	228	October 2, 2024
Help with communication across multiple clusters Consul	1	348	July 13, 2022
Debugging consul mesh gateways Consul	0	437	November 20, 2021
Kubernetes - Service Mesh - Access to service instances Consul k8s , service-mesh	5	1208	October 30, 2020

Single Consul Datacenter in Multiple Kubernetes Clusters Connection Failure

Related topics