Connection failure in federation between VMs (primary) and kubernetes

Hi @ishustava1

You’re welcome. In fact, it allowed you to find the origin of the problem right away : )

It was indeed a problem in the election of a leader server in dc2, thank you. I changed the number of replicas for the server and the federation went one step further.

dc2 Helm chart values.yml:

  server:
    replicas: 1

Unfortunately, I’m encountering another error :confused: , if you have time to give your opinion it would be great!

I recreated/redistributed the certificates of the servers of dc1 by adding the name of the node as explained in WAN Federation documentation. Example:

consul tls cert create -server -dc dc1 -node ceph-2.hirsingue.infra.mydomain.fr

After this step, dc2 is able to join dc1 and is displayed as alive. After a few seconds however, dc2 is indicated as being in a failed state.

I think the problem comes from here (dc2 server):

2022-03-22T08:42:47.170Z [INFO]  agent: (WAN) joined: number_of_nodes=1
2022-03-22T08:42:47.170Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=WAN num_agents=1
2022-03-22T08:42:47.170Z [INFO]  agent.server: Handled event for server in area: event=member-join server=ceph-2.hirsingue.infra.mydomain.fr.dc1 area=wan
2022-03-22T08:42:47.657Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 192.168.11.11:8302: node name does not encode a datacenter: ceph-2.hirsingue.infra.mydomain.fr.dc1

Can a node name in FQDN format be a problem?

I found this issue, but it doesn’t seem to have an error with gossip: Consul 1.8 - Error during WAN federation

I have the following errors too (dc2 server):

2022-03-22T08:43:26.147Z [WARN]  agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "cf6436f3-4ee4-9c5a-59eb-c48adbd89ddd": Node name compute-1.hirsingue.infra.mydomain.fr is reserved by node 126b0449-1a92-166b-8ce8-2378bc37b543 with name compute-1.hirsingue.infra.mydomain.fr (10.42.0.123)"
2022-03-22T08:43:26.147Z [ERROR] agent.server: failed to reconcile member: member="{compute-1.hirsingue.infra.mydomain.fr 10.42.0.160 8301 map[build:1.11.2:37c7d06b dc:dc2 id:cf6436f3-4ee4-9c5a-59eb-c48adbd89ddd role:node segment: vsn:2 vsn_max:3 vsn_min:2] alive 1 5 2 2 5 4}" partition=default error="failed inserting node: Error while renaming Node ID: "cf6436f3-4ee4-9c5a-59eb-c48adbd89ddd": Node name compute-1.hirsingue.infra.mydomain.fr is reserved by node 126b0449-1a92-166b-8ce8-2378bc37b543 with name compute-1.hirsingue.infra.mydomain.fr (10.42.0.123)"

Strange fact, the k3s cluster is on compute-1, that right, but I can’t find the record in consul:

root@ceph-2:~ # consul members -detailed                                                                                                                                                                     10:43 0
Node                            Address             Status  Tags
ceph-1.hirsingue.infra.mydomain.fr  192.168.11.10:8301  alive   acls=1,ap=default,build=1.11.2:37c7d06b,dc=dc1,expect=3,ft_fs=1,ft_si=1,id=23de9416-f021-b420-a9ef-6dd73313c54b,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
ceph-2.hirsingue.infra.mydomain.fr  192.168.11.11:8301  alive   acls=1,ap=default,build=1.11.2:37c7d06b,dc=dc1,expect=3,ft_fs=1,ft_si=1,id=d893d4cf-d43f-13e2-38b4-ee593d86d829,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
ceph-3.hirsingue.infra.mydomain.fr  192.168.11.12:8301  alive   acls=1,ap=default,build=1.11.2:37c7d06b,dc=dc1,expect=3,ft_fs=1,ft_si=1,id=0781d8ae-72a5-2a4b-9892-2aba24ea38f7,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
root@ceph-2:~ # consul members -detailed -wan                                                                                                                                                                10:48 0
Node                                Address             Status  Tags
ceph-2.hirsingue.infra.mydomain.fr.dc1  192.168.11.11:8302  alive   acls=1,ap=default,build=1.11.2:37c7d06b,dc=dc1,expect=3,ft_fs=1,ft_si=1,id=d893d4cf-d43f-13e2-38b4-ee593d86d829,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
consul-consul-server-0.dc2          10.42.0.164:8302    failed  acls=1,ap=default,bootstrap=1,build=1.11.2:37c7d06b,dc=dc2,ft_fs=1,ft_si=1,id=c6226cd1-b686-5e17-cf23-22bbc5d42e06,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
root@ceph-2:~ # consul operator raft list-peers                                                                                                                                                              10:49 0
Node                            ID                                    Address             State     Voter  RaftProtocol
ceph-2.hirsingue.infra.mydomain.fr  d893d4cf-d43f-13e2-38b4-ee593d86d829  192.168.11.11:8300  follower  true   3
ceph-3.hirsingue.infra.mydomain.fr  0781d8ae-72a5-2a4b-9892-2aba24ea38f7  192.168.11.12:8300  follower  true   3
ceph-1.hirsingue.infra.mydomain.fr  23de9416-f021-b420-a9ef-6dd73313c54b  192.168.11.10:8300  leader    true   3

The problem seems to come from dc1: when I recreate the resources in dc2, it is the same id that has reserved the compute-1 name (126b0449-1a92-166b-8ce8-2378bc37b543).

I searched in the logs of the pod consul-server and consul-client in dc2, but found no such ID…

On the dc1 side, the only relevant info in the logs are the following (ceph-2):

2022-03-22T09:42:47.169+0100 [INFO]  agent.server.serf.wan: serf: EventMemberJoin: consul-consul-server-0.dc2 10.42.0.164
2022-03-22T09:42:47.170+0100 [INFO]  agent.server: Handled event for server in area: event=member-join server=consul-consul-server-0.dc2 area=wan
2022-03-22T09:42:47.257+0100 [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.42.0.164:8302: read tcp 192.168.11.11:56940->192.168.11.10:8555: read: connection reset by peer
[...]
2022-03-22T09:43:15.757+0100 [ERROR] agent.server.memberlist.wan: memberlist: Failed to send compound ping and suspect message to 10.42.0.164:8302: read tcp 192.168.11.11:56990->192.168.11.10:8555: read: connection reset by peer
2022-03-22T09:43:40.756+0100 [INFO]  agent.server.memberlist.wan: memberlist: Marking consul-consul-server-0.dc2 as failed, suspect timeout reached (0 peer confirmations)
2022-03-22T09:43:40.756+0100 [INFO]  agent.server.serf.wan: serf: EventMemberFailed: consul-consul-server-0.dc2 10.42.0.164
2022-03-22T09:43:40.757+0100 [INFO]  agent.server.memberlist.wan: memberlist: Suspect consul-consul-server-0.dc2 has failed, no acks received

I also notice that consul members -wan gives me a different result depending on the node, but I guess this is a consequence of the errors with the gossip above.

Thanks for your help : )

Additional file: