Connection failure in federation between VMs (primary) and kubernetes

Hello guys,

I have a federation issue between a k8s cluster and VMs as follows.

I try to setup a federated consul cluster (named dc2) on k3s with helm. The primary consul cluster runs on VMs (named dc1).

To achieve this, I followed this guide : Kubernets and VM cluster federation

So, on top of a running 3 node consul cluster (dc1), I have:

  • created a built-in CA (consul tls ca create) and copied the cert on each dc1 server.
  • created a certificate/key per dc1 server and deployed them (consul tls cert create -server -dc dc1)
  • updated the configuration of the dc1 servers to use TLS
    "verify_incoming": true,
    "verify_incoming_rpc": true,
    "verify_outgoing": true,
    "verify_server_hostname": true,
  • updated the configuration of the servers (in dc1) to activate the federation by mesh gateway.
   "connect": {
        "ca_provider": "consul",
        "enable_mesh_gateway_wan_federation": true
    },
    "primary_datacenter": "dc1"
kubectl create secret -n consul generic consul-federation \
        --from-literal=caCert="$(cat consul-agent-ca.pem)" \
        --from-literal=caKey="$(cat consul-agent-ca-key.pem)"
        --from-literal=replicationToken=<my_replication_token>
        --from-literal=gossipEncryptionKey=<my_gossip_encryption_key>
consul connect envoy -gateway=mesh -register \
                     -service "gateway-dc1" \
                     -address "192.168.11.10:8555" \
                     -wan-address "192.168.11.10:8555"\
                     -token=<my_meshgateway_dc1_acl_token> \
                     --expose-servers

The dc1 is composed of three VMs with ip: 192.168.11.10, 192.168.11.11, 192.168.11.12. The dc1 mesh gateway is on 192.168.11.10. The k3s is single node and its ip is 192.168.11.20.

These 4 instances can ping each other and there are no firewall restrictions. I can ping for example the mesh gateway of dc1 from one of the consul servers containers in k3s.

However, the consul server shows me the following errors at start (complete logs in a file at the end to avoid flooding):

2022-03-16T22:44:50.751Z [INFO]  agent.server.gateway_locator: updated fallback list of primary mesh gateways: mesh_gateways=[192.168.11.10:8555]
2022-03-16T22:44:50.751Z [INFO]  agent: Refreshing mesh gateways completed
2022-03-16T22:44:50.751Z [INFO]  agent: Retry join is supported for the following discovery methods: cluster=WAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
2022-03-16T22:44:50.751Z [INFO]  agent: Joining cluster...: cluster=WAN
2022-03-16T22:44:50.751Z [INFO]  agent: (WAN) joining: wan_addresses=[*.dc1/192.0.2.2]
2022-03-16T22:44:50.751Z [WARN]  agent: (WAN) couldn't join: number_of_nodes=0 error="1 error occurred:
	* Failed to join 192.0.2.2:8302: Remote DC has no server currently reachable

"
2022-03-16T22:44:50.751Z [WARN]  agent: Join cluster failed, will retry: cluster=WAN retry_interval=30s error=<nil>

I was confused by the TEST-NET-1 address `192.0.2.2 in logs but it seems to be normal (according to agent/retry_join.go:68).

I have other pods that are pending because of missing ACLs

2022-03-17T11:17:42.784Z [ERROR] Failure: calling /agent/self to get datacenter: err="Unexpected response code: 403 (ACL not found)"
4
2022-03-17T11:17:42.784Z [INFO]  Retrying in 1s

but I guess it’s normal as the first step is to connect to the primary dc and then sync the ACLs.

After checking all my config, I’ve no more idea, if you have a clue I’ll be happy to hear you about it :slight_smile:

Additional files if it can help :

Hey @root

Thanks so much for this detailed description.

From looking at the logs of the consul servers in dc2, it seems like the main issue is that there’s no leader in the dc2 cluster. There first has to be a leader before federation can be established. Specifically, these logs look problematic:

agent.server.memberlist.lan: memberlist: Failed to resolve consul-consul-server-1.consul-consul-server.consul.svc:8301: lookup consul-consul-server-1.consul-consul-server.consul.svc on 10.43.0.10:53: no such host
2022-03-16T22:45:21.453Z [WARN]  agent.server.memberlist.lan: memberlist: Failed to resolve consul-consul-server-2.consul-consul-server.consul.svc:8301: lookup consul-consul-server-2.consul-consul-server.consul.svc on 10.43.0.10:53: no such host

Is this a problem that continues in the logs or does it eventually recover and you see that the leader has been elected?

Otherwise, your configuration looks good to me.

Hi @ishustava1

You’re welcome. In fact, it allowed you to find the origin of the problem right away : )

It was indeed a problem in the election of a leader server in dc2, thank you. I changed the number of replicas for the server and the federation went one step further.

dc2 Helm chart values.yml:

  server:
    replicas: 1

Unfortunately, I’m encountering another error :confused: , if you have time to give your opinion it would be great!

I recreated/redistributed the certificates of the servers of dc1 by adding the name of the node as explained in WAN Federation documentation. Example:

consul tls cert create -server -dc dc1 -node ceph-2.hirsingue.infra.mydomain.fr

After this step, dc2 is able to join dc1 and is displayed as alive. After a few seconds however, dc2 is indicated as being in a failed state.

I think the problem comes from here (dc2 server):

2022-03-22T08:42:47.170Z [INFO]  agent: (WAN) joined: number_of_nodes=1
2022-03-22T08:42:47.170Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=WAN num_agents=1
2022-03-22T08:42:47.170Z [INFO]  agent.server: Handled event for server in area: event=member-join server=ceph-2.hirsingue.infra.mydomain.fr.dc1 area=wan
2022-03-22T08:42:47.657Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 192.168.11.11:8302: node name does not encode a datacenter: ceph-2.hirsingue.infra.mydomain.fr.dc1

Can a node name in FQDN format be a problem?

I found this issue, but it doesn’t seem to have an error with gossip: Consul 1.8 - Error during WAN federation

I have the following errors too (dc2 server):

2022-03-22T08:43:26.147Z [WARN]  agent.fsm: EnsureRegistration failed: error="failed inserting node: Error while renaming Node ID: "cf6436f3-4ee4-9c5a-59eb-c48adbd89ddd": Node name compute-1.hirsingue.infra.mydomain.fr is reserved by node 126b0449-1a92-166b-8ce8-2378bc37b543 with name compute-1.hirsingue.infra.mydomain.fr (10.42.0.123)"
2022-03-22T08:43:26.147Z [ERROR] agent.server: failed to reconcile member: member="{compute-1.hirsingue.infra.mydomain.fr 10.42.0.160 8301 map[build:1.11.2:37c7d06b dc:dc2 id:cf6436f3-4ee4-9c5a-59eb-c48adbd89ddd role:node segment: vsn:2 vsn_max:3 vsn_min:2] alive 1 5 2 2 5 4}" partition=default error="failed inserting node: Error while renaming Node ID: "cf6436f3-4ee4-9c5a-59eb-c48adbd89ddd": Node name compute-1.hirsingue.infra.mydomain.fr is reserved by node 126b0449-1a92-166b-8ce8-2378bc37b543 with name compute-1.hirsingue.infra.mydomain.fr (10.42.0.123)"

Strange fact, the k3s cluster is on compute-1, that right, but I can’t find the record in consul:

root@ceph-2:~ # consul members -detailed                                                                                                                                                                     10:43 0
Node                            Address             Status  Tags
ceph-1.hirsingue.infra.mydomain.fr  192.168.11.10:8301  alive   acls=1,ap=default,build=1.11.2:37c7d06b,dc=dc1,expect=3,ft_fs=1,ft_si=1,id=23de9416-f021-b420-a9ef-6dd73313c54b,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
ceph-2.hirsingue.infra.mydomain.fr  192.168.11.11:8301  alive   acls=1,ap=default,build=1.11.2:37c7d06b,dc=dc1,expect=3,ft_fs=1,ft_si=1,id=d893d4cf-d43f-13e2-38b4-ee593d86d829,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
ceph-3.hirsingue.infra.mydomain.fr  192.168.11.12:8301  alive   acls=1,ap=default,build=1.11.2:37c7d06b,dc=dc1,expect=3,ft_fs=1,ft_si=1,id=0781d8ae-72a5-2a4b-9892-2aba24ea38f7,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
root@ceph-2:~ # consul members -detailed -wan                                                                                                                                                                10:48 0
Node                                Address             Status  Tags
ceph-2.hirsingue.infra.mydomain.fr.dc1  192.168.11.11:8302  alive   acls=1,ap=default,build=1.11.2:37c7d06b,dc=dc1,expect=3,ft_fs=1,ft_si=1,id=d893d4cf-d43f-13e2-38b4-ee593d86d829,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
consul-consul-server-0.dc2          10.42.0.164:8302    failed  acls=1,ap=default,bootstrap=1,build=1.11.2:37c7d06b,dc=dc2,ft_fs=1,ft_si=1,id=c6226cd1-b686-5e17-cf23-22bbc5d42e06,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
root@ceph-2:~ # consul operator raft list-peers                                                                                                                                                              10:49 0
Node                            ID                                    Address             State     Voter  RaftProtocol
ceph-2.hirsingue.infra.mydomain.fr  d893d4cf-d43f-13e2-38b4-ee593d86d829  192.168.11.11:8300  follower  true   3
ceph-3.hirsingue.infra.mydomain.fr  0781d8ae-72a5-2a4b-9892-2aba24ea38f7  192.168.11.12:8300  follower  true   3
ceph-1.hirsingue.infra.mydomain.fr  23de9416-f021-b420-a9ef-6dd73313c54b  192.168.11.10:8300  leader    true   3

The problem seems to come from dc1: when I recreate the resources in dc2, it is the same id that has reserved the compute-1 name (126b0449-1a92-166b-8ce8-2378bc37b543).

I searched in the logs of the pod consul-server and consul-client in dc2, but found no such ID…

On the dc1 side, the only relevant info in the logs are the following (ceph-2):

2022-03-22T09:42:47.169+0100 [INFO]  agent.server.serf.wan: serf: EventMemberJoin: consul-consul-server-0.dc2 10.42.0.164
2022-03-22T09:42:47.170+0100 [INFO]  agent.server: Handled event for server in area: event=member-join server=consul-consul-server-0.dc2 area=wan
2022-03-22T09:42:47.257+0100 [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.42.0.164:8302: read tcp 192.168.11.11:56940->192.168.11.10:8555: read: connection reset by peer
[...]
2022-03-22T09:43:15.757+0100 [ERROR] agent.server.memberlist.wan: memberlist: Failed to send compound ping and suspect message to 10.42.0.164:8302: read tcp 192.168.11.11:56990->192.168.11.10:8555: read: connection reset by peer
2022-03-22T09:43:40.756+0100 [INFO]  agent.server.memberlist.wan: memberlist: Marking consul-consul-server-0.dc2 as failed, suspect timeout reached (0 peer confirmations)
2022-03-22T09:43:40.756+0100 [INFO]  agent.server.serf.wan: serf: EventMemberFailed: consul-consul-server-0.dc2 10.42.0.164
2022-03-22T09:43:40.757+0100 [INFO]  agent.server.memberlist.wan: memberlist: Suspect consul-consul-server-0.dc2 has failed, no acks received

I also notice that consul members -wan gives me a different result depending on the node, but I guess this is a consequence of the errors with the gossip above.

Thanks for your help : )

Additional file:

Hey @root

I’m glad we resolved the first issue.

It looks like you’re right, and consul WAN federation does not work with FQDNs as node names:

This looks like a potential bug in Consul, so I’d recommend reporting it in github.com/hashicorp/consul as an issue.

Hi @ishustava1,

Thanks for reply!

I will open an issue on GitHub, I found a workaround, for now, I changed my nodes’ names and recreated the certificates to match them (eg: ceph-1.hirsingue.infra.mydomain.fr to ceph-1).

I found the following topic: Mesh Gateway federation woes!.
So I have:

  • changed the Meshgateway of dc2 so that it is a NodePort that exposes it.
  meshGateway:
    enabled: true
    replicas: 1
    service:
      nodePort: 30555
      enabled: true
      type: NodePort
    wanAddress:
      enabled: true
      source: "Static"
      static: "192.168.11.20"
      port: 30555
  • created a proxy defaults config. Unless I’m mistaken, this will have an impact later on the services and doesn’t influence the federation, right?
apiVersion: consul.hashicorp.com/v1alpha1
kind: ProxyDefaults
metadata:
  name: global
spec:
  meshGateway:
    mode: local

The second dc however always encounters an error, he doesn’t get responses from his requests to dc1.

dc2 consul server logs:

2022-03-23T16:33:31.488Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: ceph-2.dc1 192.168.11.11
2022-03-23T16:33:31.488Z [INFO]  agent.server: Handled event for server in area: event=member-join server=ceph-2.dc1 area=wan
2022-03-23T16:34:10.054Z [INFO]  agent.server.memberlist.wan: memberlist: Suspect ceph-3.dc1 has failed, no acks received
2022-03-23T16:34:31.494Z [WARN]  agent.server.memberlist.wan: memberlist: Refuting a suspect message (from: consul-consul-server-0.dc2)
2022-03-23T16:34:50.055Z [INFO]  agent.server.memberlist.wan: memberlist: Suspect ceph-2.dc1 has failed, no acks received
2022-03-23T16:35:20.056Z [INFO]  agent.server.memberlist.wan: memberlist: Marking ceph-2.dc1 as failed, suspect timeout reached (0 peer confirmations)
2022-03-23T16:35:20.056Z [INFO]  agent.server.serf.wan: serf: EventMemberFailed: ceph-2.dc1 192.168.11.11

I guess dc1 is trying to contact dc2’s server with its k8s “private” IP (10.42.0.77) which is obviously not accessible. When I list the members :

root@ceph-2:~ # consul members -wan
Node                        Address             Status  Type    Build   Protocol  DC   Partition  Segment
ceph-1.dc1                  192.168.11.10:8302  alive   server  1.11.2  2         dc1  default    <all>
ceph-2.dc1                  192.168.11.11:8302  alive   server  1.11.2  2         dc1  default    <all>
ceph-3.dc1                  192.168.11.12:8302  alive   server  1.11.2  2         dc1  default    <all>
consul-consul-server-0.dc2  10.42.0.77:8302     alive   server  1.11.2  2         dc2  default    <all>

On the dc1 server side, I have the following logs:

2022-03-23T17:12:38.483+0100 [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.42.0.77:8302: read tcp 192.168.11.10:42026->192.168.11.10:30555: read: connection reset by peer
2022-03-23T17:12:38.787+0100 [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.42.0.77:8300 datacenter=dc2 method=Internal.ServiceDump error="rpc error getting client: failed to get conn: read tcp 192.168.11.10:54879->192.168.11.10:30555: read: connection reset by peer"
2022-03-23T17:12:38.817+0100 [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.42.0.77:8300 datacenter=dc2 method=Internal.ServiceDump error="rpc error getting client: failed to get conn: read tcp 192.168.11.10:34443->192.168.11.10:30555: read: connection reset by peer"
2022-03-23T17:12:40.978+0100 [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.42.0.77:8300 datacenter=dc2 method=Internal.ServiceDump error="rpc error getting client: failed to get conn: read tcp 192.168.11.10:35321->192.168.11.10:30555: read: connection reset by peer"

I have no idea why dc1 does not use the ip 192.168.11.20

Thanks for your help :slight_smile:

I made a tdpdump dst 10.42.0.77 (k8s internal IP of dc2 server pod) on ceph-1 to check if meshgateway route traffic to unreachable IP. Nothing goes out of this test so, maybe something else is going wrong.

It turns out that the meshgateway of dc2 is not ok :

Error registering service "mesh-gateway": Unexpected response code: 403 (ACL not found)

But, I just verified, by logging into the consul GUI: dc2 has ACLs properly initialized. I find the tokens of the dc1 for example, so for me, all is fine on this part.

I am looking in this direction.

It was indeed a problem with the Meshgateway not being able to connect to the dc2 server.

I didn’t know that but:

  • deleting the helm chart resources does not delete all secrets
  • recreating the helm chart does not mean updating the secrets if they are already present (more inconvenient, is this an issue ?)

In the end, the consul-consul-mesh-gateway-acl-token secret was never updated between fresh consul bootstrapping in dc2 and therefore its token did not allow the gate to connect to the server.


I have one last question. I was following this steps to check all is working properly (Doc: Verifying Federation) and I noticed that consul cli return nothing about dc2:

kubectl exec -it -n consul consul-consul-server-0 -- sh
/ $ consul members
/ $ consul catalog services -datacenter dc2
No services match the given query - try expanding your search.
/ $ consul catalog services -datacenter dc1
consul
gateway-dc1
root@ceph-1:~ # consul members -wan                                                                0:37 0
Node                        Address             Status  Type    Build   Protocol  DC   Partition  Segment
ceph-1.dc1                  192.168.11.10:8302  alive   server  1.11.2  2         dc1  default    <all>
ceph-2.dc1                  192.168.11.11:8302  alive   server  1.11.2  2         dc1  default    <all>
ceph-3.dc1                  192.168.11.12:8302  alive   server  1.11.2  2         dc1  default    <all>
consul-consul-server-0.dc2  10.42.0.89:8302     alive   server  1.11.2  2         dc2  default    <all>
root@ceph-1:~ # consul catalog services -datacenter dc1                                            0:46 0
consul
gateway-dc1
root@ceph-1:~ # consul catalog services -datacenter dc2                                            0:46 0
No services match the given query - try expanding your search.

is this a normal behavior?

In my opinion, all is ok now because in the GUI I can navigate through both dc and show nodes and services. I just wanted to make sure it wasn’t hiding a real problem before going any further

Thanks!

FYI we are building a consul-k8s CLI (Installing the Consul K8s CLI | Consul by HashiCorp) to better handle the issue with secrets not getting cleaned up.

I think this might be due to the anonymous token policy on dc1 (ACL System | Consul by HashiCorp). For cross-dc service mesh communication it needs to be set to:

    node_prefix "" {
       policy = "read"
    }
    service_prefix "" {
       policy = "read"
    }

Hi @lkysow,

ok, great news!

You’re right, I tried with a different token (simply using -token flag on CLI) and it worked :tada:

It’s done, there is the link: Wan federation doesn't work with FQDN node name · Issue #12614 · hashicorp/consul · GitHub

Thank you both for help!