Connection failure in federation between VMs (primary) and kubernetes

Hi @ishustava1,

Thanks for reply!

I will open an issue on GitHub, I found a workaround, for now, I changed my nodes’ names and recreated the certificates to match them (eg: ceph-1.hirsingue.infra.mydomain.fr to ceph-1).

I found the following topic: Mesh Gateway federation woes!.
So I have:

  • changed the Meshgateway of dc2 so that it is a NodePort that exposes it.
  meshGateway:
    enabled: true
    replicas: 1
    service:
      nodePort: 30555
      enabled: true
      type: NodePort
    wanAddress:
      enabled: true
      source: "Static"
      static: "192.168.11.20"
      port: 30555
  • created a proxy defaults config. Unless I’m mistaken, this will have an impact later on the services and doesn’t influence the federation, right?
apiVersion: consul.hashicorp.com/v1alpha1
kind: ProxyDefaults
metadata:
  name: global
spec:
  meshGateway:
    mode: local

The second dc however always encounters an error, he doesn’t get responses from his requests to dc1.

dc2 consul server logs:

2022-03-23T16:33:31.488Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: ceph-2.dc1 192.168.11.11
2022-03-23T16:33:31.488Z [INFO]  agent.server: Handled event for server in area: event=member-join server=ceph-2.dc1 area=wan
2022-03-23T16:34:10.054Z [INFO]  agent.server.memberlist.wan: memberlist: Suspect ceph-3.dc1 has failed, no acks received
2022-03-23T16:34:31.494Z [WARN]  agent.server.memberlist.wan: memberlist: Refuting a suspect message (from: consul-consul-server-0.dc2)
2022-03-23T16:34:50.055Z [INFO]  agent.server.memberlist.wan: memberlist: Suspect ceph-2.dc1 has failed, no acks received
2022-03-23T16:35:20.056Z [INFO]  agent.server.memberlist.wan: memberlist: Marking ceph-2.dc1 as failed, suspect timeout reached (0 peer confirmations)
2022-03-23T16:35:20.056Z [INFO]  agent.server.serf.wan: serf: EventMemberFailed: ceph-2.dc1 192.168.11.11

I guess dc1 is trying to contact dc2’s server with its k8s “private” IP (10.42.0.77) which is obviously not accessible. When I list the members :

root@ceph-2:~ # consul members -wan
Node                        Address             Status  Type    Build   Protocol  DC   Partition  Segment
ceph-1.dc1                  192.168.11.10:8302  alive   server  1.11.2  2         dc1  default    <all>
ceph-2.dc1                  192.168.11.11:8302  alive   server  1.11.2  2         dc1  default    <all>
ceph-3.dc1                  192.168.11.12:8302  alive   server  1.11.2  2         dc1  default    <all>
consul-consul-server-0.dc2  10.42.0.77:8302     alive   server  1.11.2  2         dc2  default    <all>

On the dc1 server side, I have the following logs:

2022-03-23T17:12:38.483+0100 [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.42.0.77:8302: read tcp 192.168.11.10:42026->192.168.11.10:30555: read: connection reset by peer
2022-03-23T17:12:38.787+0100 [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.42.0.77:8300 datacenter=dc2 method=Internal.ServiceDump error="rpc error getting client: failed to get conn: read tcp 192.168.11.10:54879->192.168.11.10:30555: read: connection reset by peer"
2022-03-23T17:12:38.817+0100 [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.42.0.77:8300 datacenter=dc2 method=Internal.ServiceDump error="rpc error getting client: failed to get conn: read tcp 192.168.11.10:34443->192.168.11.10:30555: read: connection reset by peer"
2022-03-23T17:12:40.978+0100 [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.42.0.77:8300 datacenter=dc2 method=Internal.ServiceDump error="rpc error getting client: failed to get conn: read tcp 192.168.11.10:35321->192.168.11.10:30555: read: connection reset by peer"

I have no idea why dc1 does not use the ip 192.168.11.20

Thanks for your help :slight_smile: