Federation stops working after consul service/server restarted on primary cluster

Hi Team,

Can somebody help us with the below issue, We have almost spent a week fixing this and still no luck.

Overview of the Issue

we have two clusters primary on VMs and secondary on Kubernetes, Federation via Mesh Gateways is working and all communications are as expected but if we happen to restart any host of the primary consul cluster, or restart service on any host of the primary cluster. federation stops working.

Reproduction Steps

  1. Deployed primary cluster (3 nodes cluster on virtual machines)
  2. Mesh, Ingress, Terminating gateways deployed on Primary cluster.
  3. Deployed Secondary cluster(consul connect Helm Chart on Kubernetes)

To this point, everything is working.

  1. below is the result on server01 of the primary consul cluster.
    #consul members -wan
    Node Address Status Type Build Protocol DC Partition Segment
    server01.pod_name.secondary_dc_name <pod_ip>: alive server 1.11.1 2 secondary_dc default
    server02.pod_name.secondary_dc_name <pod_ip>: alive server 1.11.1 2 secondary_dc default
    server03.pod_name.secondary_dc_name <pod_ip>: alive server 1.11.1 2 secondary_dc default
    server01.<server_name>.primary_dc_name : alive server 1.11.1 2 primary_dc default
    server02.<server_name>.primary_dc_name : alive server 1.11.1 2 primary_dc default
    server03.<server_name>.primary_dc_name : alive server 1.11.1 2 primary_dc default
  2. restarted consul service on server01.<server_name>.primary_dc_name.

systemctl restart consul.service

  1. now below is result on same server01 of primary consul cluster .
    #consul members -wan
    Node Address Status Type Build Protocol DC Partition Segment
    server01.<server_name>.primary_dc_name : alive server 1.11.1 2 primary_dc default

Consul info for both Client and Server

  1. consul version: v1.11.1 (both client and server)
  2. Architecture: primary cluster on VM and secondary cluster on Kubernetes.
  3. Federation is enabled using mesh gateways on both clusters.
  4. Consul connect Helm Chart version: 0.39.0

output from client ‘consul info’ command here

agent:
check_monitors = 0
check_ttls = 0
checks = 1
services = 1
build:
prerelease =
revision = 2c56447
version = 1.11.1
consul:
acl = disabled
known_servers = 3
server = false
runtime:
arch = amd64
cpu_count = 4
goroutines = 115
max_procs = 4
os = linux
version = go1.17.5
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 11
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 64
members = 8
query_queue = 0
query_time = 1

Server info

output from server ‘consul info’ command here

agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 0
build:
prerelease =
revision = 2c56447
version = 1.11.1
consul:
acl = enabled
bootstrap = false
known_datacenters = 2
leader = false
leader_addr = 10.44.182.126:8300
server = true
raft:
applied_index = 6792
commit_index = 6792
fsm_pending = 0
last_contact = 7.283558ms
last_log_index = 6792
last_log_term = 11
last_snapshot_index = 0
last_snapshot_term = 0
latest_configuration = [{Suffrage:Voter ID:f70d584f-b24d-d5e6-6bc8-78f75ef4de90 Address:10.44.182.125:8300} {Suffrage:Voter ID:c25e8171-86d3-0e88-a2e5-36865157b4a0 Address:10.44.182.126:8300} {Suffrage:Voter ID:270f5410-3ea3-79e0-441a-d8451677084e Address:10.44.182.127:8300}]
latest_configuration_index = 0
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 11
runtime:
arch = amd64
cpu_count = 4
goroutines = 195
max_procs = 4
os = linux
version = go1.17.5
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 11
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 64
members = 8
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 7
intent_queue = 0
left = 0
member_time = 17
members = 6
query_queue = 0
query_time = 1

Operating system and Environment details

  1. CentOS7
  2. Consul version: v1.11.1
  3. Architecture: primary cluster on VM and secondary cluster on Kubernetes.
  4. Federation is enabled using mesh gateways on both clusters.
  5. Consul connect Helm Chart version: 0.39.0

Conclusion: seems like restarting consul service on the primary cluster resulted in losing federation between the two clusters.
Thanks in advance.

With Regards,
Bankat Vikhe

When we restart the consul service, seems like it looks for pod IP instead of going through Mesh-Gateway. Default proxy have set correctly. if we delete the pods on secondary cluster, then federation reestablish again.

Mar 31 09:23:04 consul05.eng.com consul[16250]: 2022-03-31T09:23:04.800Z [INFO] agent.server.serf.wan: serf: Attempting re-join to previously known node: consul-server-1.eng-k8s: 10.244.223.149:50711
Mar 31 09:23:04 consul05.eng.com consul[16250]: 2022-03-31T09:23:04.800Z [DEBUG] agent.server.memberlist.wan: memberlist: Failed to join 10.244.223.149:50711: Remote DC has no server currently reachable

Hey @bvikhe1

I just responded to the github issue you raised for this :slight_smile:

1 Like

hey @Amier ,

Replied with required details.

Thanks