Hi Folks,
I’m hoping someone can give me a few pointers here!
I’ve been running Consul and Nomad in a homelab environment for the last year for service discovery and monitoring and it’s all being working fine.
I’m in the process of introduing Kubernetes into the mix and have hit a wall with the static-server and static-client test deployement.
Due to my inxperience with the Connect Service Mesh I can’t tell if its an issue with Consul in general of a specific issue with Connect / Envoy in my particular environemnt.
I’ve set up two fresh VMS on the same host to implement the Single Consul Datacenter in Multiple Kubernetes Clusters.
I have set them up as dc2 as not to conflict with my existing Consul / Nomad set up which is dc1. I have no intection of connecting them. dc2 is purely a test environment.
- k3s-mesh-server is Cluster1 (with static-server and static-client)
- k3s-mesh-client is Cluster2 (just static-client)
I’ve followed the tutorial practically verbatim with a few tweaks to the helm values to get the Cloud Auto-join to work
Cluster1
global:
datacenter: dc2
tls:
enabled: true
enableAutoEncrypt: true
acls:
manageSystemACLs: true
gossipEncryption:
secretName: consul-gossip-encryption-key
secretKey: key
connectInject:
enabled: true
replicas: 1
logLevel: debug
controller:
enabled: true
replicas: 1
server:
enabled: true
replicas: 1
storage: 2Gi
### adding this as the k8s joiner will require access to connect from a different cluster
exposeGossipAndRPCPorts: true
ports:
serflan:
port: 9301
annotations: |
"consul.hashicorp.com/auto-join-port": "9301"
ui:
service:
type: NodePort
Cluster2
global:
enabled: false
datacenter: dc2
acls:
manageSystemACLs: true
bootstrapToken:
secretName: cluster1-consul-bootstrap-acl-token
secretKey: token
gossipEncryption:
secretName: consul-gossip-encryption-key
secretKey: key
tls:
enabled: true
enableAutoEncrypt: true
caCert:
secretName: cluster1-consul-ca-cert
secretKey: tls.crt
externalServers:
enabled: true
# This should be any node IP of the first k8s cluster
hosts: ["192.168.178.103"]
# The node port of the UI's NodePort service
httpsPort: 32033
tlsServerName: server.dc2.consul
# The address of the kube API server of this Kubernetes cluster
k8sAuthMethodHost: https://k3s-mesh-client:6443
client:
enabled: true
exposeGossipPorts: true
join: ["provider=k8s kubeconfig=/consul/userconfig/cluster1-kubeconfig/kubeconfig host_network=true namespace=\"consul\" label_selector=\"app=consul,component=server\""]
extraConfig: |
{
"log_level": "TRACE"
}
extraVolumes:
- type: secret
name: cluster1-kubeconfig
load: false
connectInject:
enabled: true
replicas: 1
logLevel: debug
The Client connects to the Server with some complaints
2022-04-28T11:19:47.136Z [TRACE] agent.tlsutil: IncomingHTTPSConfig: version=5
2022-04-28T11:19:47.138Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from=127.0.0.1:48148 latency=332.384µs
2022-04-28T11:19:47.271Z [DEBUG] agent.client.memberlist.lan: memberlist: Stream connection from=192.168.178.103:1605
2022-04-28T11:19:47.496Z [DEBUG] agent.client.memberlist.lan: memberlist: Stream connection from=192.168.178.103:13497
2022-04-28T11:19:47.635Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: cluster1-consul-server-0 (timeout reached)
2022-04-28T11:19:48.135Z [WARN] agent.client.memberlist.lan: memberlist: Was able to connect to cluster1-consul-server-0 but other probes failed, network may be misconfigured
2022-04-28T11:19:49.272Z [DEBUG] agent.client.memberlist.lan: memberlist: Stream connection from=192.168.178.103:62091
2022-04-28T11:19:49.635Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed ping: cluster1-consul-server-0 (timeout reached)
2022-04-28T11:19:49.638Z [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with: cluster1-consul-server-0 192.168.178.103:9301
2022-04-28T11:19:49.638Z [INFO] agent.client.serf.lan: serf: EventMemberJoin: k3s-mesh-server 10.42.0.79
Despite this the cli lists them as connected
Node Address Status Type Build Protocol DC Partition Segment
cluster1-consul-server-0 192.168.178.103:9301 alive server 1.12.0 2 dc2 default <all>
k3s-mesh-client 192.168.178.40:8301 alive client 1.12.0 2 dc2 default <default>
k3s-mesh-server 10.42.0.79:8301 alive client 1.12.0 2 dc2 default <default>
So I ploughed on and deployed 1 static-server to Cluster1 and a static-client to each of Cluster1 and Cluster2. On both clusters these are in a namespace called testing.
All three show up correctly in the consul ui
Despite this curl fails on both clients in the same manner:
$ curl localhost:1234 -vv
* Trying 127.0.0.1:1234...
* Connected to localhost (127.0.0.1) port 1234 (#0)
> GET / HTTP/1.1
> Host: localhost:1234
> User-Agent: curl/7.83.0-DEV
> Accept: */*
>
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer
Envoy trace log excerpt:
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/connection_impl.cc:912] [C1496] connecting to 127.0.0.1:0
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/connection_impl.cc:931] [C1496] connection in progress
[2022-04-28 11:50:34.650][13][trace][pool] [source/common/conn_pool/conn_pool_base.cc:131] not creating a new connection, shouldCreateNewConnection returned false.
[2022-04-28 11:50:34.650][13][debug][conn_handler] [source/server/active_tcp_listener.cc:142] [C1495] new connection from 10.42.0.57:51836
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/connection_impl.cc:563] [C1495] socket event: 2
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/connection_impl.cc:674] [C1495] write ready
[2022-04-28 11:50:34.650][13][trace][connection] [source/extensions/transport_sockets/tls/ssl_handshaker.cc:52] [C1495] ssl error occurred while read: SYSCALL
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/connection_impl.cc:250] [C1495] closing socket: 0
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/connection_impl.cc:418] [C1495] raising connection event 0
[2022-04-28 11:50:34.650][13][trace][filter] [source/common/tcp_proxy/tcp_proxy.cc:587] [C1495] on downstream event 0, has upstream = true
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:600] cancelling pending stream
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/connection_impl.cc:139] [C1496] closing data_to_write=0 type=1
[2022-04-28 11:50:34.650][13][debug][connection] [source/common/network/connection_impl.cc:250] [C1496] closing socket: 1
[2022-04-28 11:50:34.650][13][trace][connection] [source/common/network/connection_impl.cc:418] [C1496] raising connection event 1
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:439] [C1496] client disconnected, failure reason:
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/dispatcher_impl.cc:249] item added to deferred deletion list (size=1)
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:410] invoking idle callbacks - is_draining_for_deletion_=false
[2022-04-28 11:50:34.650][13][trace][upstream] [source/common/upstream/cluster_manager_impl.cc:1807] Idle pool, erasing pool for host 0x54603f486380
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/dispatcher_impl.cc:249] item added to deferred deletion list (size=2)
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:410] invoking idle callbacks - is_draining_for_deletion_=false
[2022-04-28 11:50:34.650][13][trace][conn_handler] [source/server/active_stream_listener_base.cc:111] [C1495] connection on event 0
[2022-04-28 11:50:34.650][13][debug][conn_handler] [source/server/active_stream_listener_base.cc:120] [C1495] adding to cleanup list
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/dispatcher_impl.cc:249] item added to deferred deletion list (size=3)
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/dispatcher_impl.cc:249] item added to deferred deletion list (size=4)
[2022-04-28 11:50:34.650][13][trace][main] [source/common/event/dispatcher_impl.cc:125] clearing deferred deletion list (size=4)
The only parts that really stick out to me are:
[2022-04-28 11:50:34.650][13][trace][connection] [source/extensions/transport_sockets/tls/ssl_handshaker.cc:52] [C1495] ssl error occurred while read: SYSCALL
and
[2022-04-28 11:50:34.650][13][debug][pool] [source/common/conn_pool/conn_pool_base.cc:439] [C1496] client disconnected, failure reason:
but I’m at a loss to what that means exactly or if its the culprit.
While the communication between clusters could fail for several reasons, I’d expect the static-server and static-client in the same cluster to function as expected.
Can anyone shed some light on this, or suggest specific troubleshooting I can perform?