Federation stops working after consul service/server restarted on primary cluster

bvikhe1 · April 1, 2022, 12:39pm

Hi Team,

Can somebody help us with the below issue, We have almost spent a week fixing this and still no luck.

Overview of the Issue

we have two clusters primary on VMs and secondary on Kubernetes, Federation via Mesh Gateways is working and all communications are as expected but if we happen to restart any host of the primary consul cluster, or restart service on any host of the primary cluster. federation stops working.

Reproduction Steps

Deployed primary cluster (3 nodes cluster on virtual machines)
Mesh, Ingress, Terminating gateways deployed on Primary cluster.
Deployed Secondary cluster(consul connect Helm Chart on Kubernetes)

To this point, everything is working.

below is the result on server01 of the primary consul cluster.
#consul members -wan
Node Address Status Type Build Protocol DC Partition Segment
server01.pod_name.secondary_dc_name <pod_ip>: alive server 1.11.1 2 secondary_dc default
server02.pod_name.secondary_dc_name <pod_ip>: alive server 1.11.1 2 secondary_dc default
server03.pod_name.secondary_dc_name <pod_ip>: alive server 1.11.1 2 secondary_dc default
server01.<server_name>.primary_dc_name : alive server 1.11.1 2 primary_dc default
server02.<server_name>.primary_dc_name : alive server 1.11.1 2 primary_dc default
server03.<server_name>.primary_dc_name : alive server 1.11.1 2 primary_dc default
restarted consul service on server01.<server_name>.primary_dc_name.

systemctl restart consul.service

now below is result on same server01 of primary consul cluster .
#consul members -wan
Node Address Status Type Build Protocol DC Partition Segment
server01.<server_name>.primary_dc_name : alive server 1.11.1 2 primary_dc default

Consul info for both Client and Server

consul version: v1.11.1 (both client and server)
Architecture: primary cluster on VM and secondary cluster on Kubernetes.
Federation is enabled using mesh gateways on both clusters.
Consul connect Helm Chart version: 0.39.0

output from client ‘consul info’ command here

agent:
check_monitors = 0
check_ttls = 0
checks = 1
services = 1
build:
prerelease =
revision = 2c56447
version = 1.11.1
consul:
acl = disabled
known_servers = 3
server = false
runtime:
arch = amd64
cpu_count = 4
goroutines = 115
max_procs = 4
os = linux
version = go1.17.5
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 11
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 64
members = 8
query_queue = 0
query_time = 1

Server info

output from server ‘consul info’ command here

agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 0
build:
prerelease =
revision = 2c56447
version = 1.11.1
consul:
acl = enabled
bootstrap = false
known_datacenters = 2
leader = false
leader_addr = 10.44.182.126:8300
server = true
raft:
applied_index = 6792
commit_index = 6792
fsm_pending = 0
last_contact = 7.283558ms
last_log_index = 6792
last_log_term = 11
last_snapshot_index = 0
last_snapshot_term = 0
latest_configuration = [{Suffrage:Voter ID:f70d584f-b24d-d5e6-6bc8-78f75ef4de90 Address:10.44.182.125:8300} {Suffrage:Voter ID:c25e8171-86d3-0e88-a2e5-36865157b4a0 Address:10.44.182.126:8300} {Suffrage:Voter ID:270f5410-3ea3-79e0-441a-d8451677084e Address:10.44.182.127:8300}]
latest_configuration_index = 0
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 11
runtime:
arch = amd64
cpu_count = 4
goroutines = 195
max_procs = 4
os = linux
version = go1.17.5
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 11
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 64
members = 8
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 7
intent_queue = 0
left = 0
member_time = 17
members = 6
query_queue = 0
query_time = 1

Operating system and Environment details

CentOS7
Consul version: v1.11.1
Architecture: primary cluster on VM and secondary cluster on Kubernetes.
Federation is enabled using mesh gateways on both clusters.
Consul connect Helm Chart version: 0.39.0

Conclusion: seems like restarting consul service on the primary cluster resulted in losing federation between the two clusters.
Thanks in advance.

With Regards,
Bankat Vikhe

bvikhe1 · April 1, 2022, 12:44pm

When we restart the consul service, seems like it looks for pod IP instead of going through Mesh-Gateway. Default proxy have set correctly. if we delete the pods on secondary cluster, then federation reestablish again.

Mar 31 09:23:04 consul05.eng.com consul[16250]: 2022-03-31T09:23:04.800Z [INFO] agent.server.serf.wan: serf: Attempting re-join to previously known node: consul-server-1.eng-k8s: 10.244.223.149:50711
Mar 31 09:23:04 consul05.eng.com consul[16250]: 2022-03-31T09:23:04.800Z [DEBUG] agent.server.memberlist.wan: memberlist: Failed to join 10.244.223.149:50711: Remote DC has no server currently reachable

Amier · April 12, 2022, 4:32pm

Hey @bvikhe1

I just responded to the github issue you raised for this

bvikhe1 · May 12, 2022, 2:47am

hey @Amier ,

Replied with required details.

Thanks

Topic		Replies	Views
Consul mesh federation and kubernetes API Consul k8s , connect	6	539	June 15, 2022
Federation between K8S (Primary) and VMs Consul	0	375	August 17, 2020
Connection failure in federation between VMs (primary) and kubernetes Consul k8s	8	2149	March 24, 2022
Mesh Gateway federation woes! Consul k8s , helm	7	1277	February 15, 2022
Getting started with mesh gateway issues Consul k8s , first-time-question	4	2262	February 14, 2022