Hi Team,
Can somebody help us with the below issue, We have almost spent a week fixing this and still no luck.
Overview of the Issue
we have two clusters primary on VMs and secondary on Kubernetes, Federation via Mesh Gateways is working and all communications are as expected but if we happen to restart any host of the primary consul cluster, or restart service on any host of the primary cluster. federation stops working.
Reproduction Steps
- Deployed primary cluster (3 nodes cluster on virtual machines)
- Mesh, Ingress, Terminating gateways deployed on Primary cluster.
- Deployed Secondary cluster(consul connect Helm Chart on Kubernetes)
To this point, everything is working.
- below is the result on server01 of the primary consul cluster.
#consul members -wan
Node Address Status Type Build Protocol DC Partition Segment
server01.pod_name.secondary_dc_name <pod_ip>: alive server 1.11.1 2 secondary_dc default
server02.pod_name.secondary_dc_name <pod_ip>: alive server 1.11.1 2 secondary_dc default
server03.pod_name.secondary_dc_name <pod_ip>: alive server 1.11.1 2 secondary_dc default
server01.<server_name>.primary_dc_name : alive server 1.11.1 2 primary_dc default
server02.<server_name>.primary_dc_name : alive server 1.11.1 2 primary_dc default
server03.<server_name>.primary_dc_name : alive server 1.11.1 2 primary_dc default - restarted consul service on server01.<server_name>.primary_dc_name.
systemctl restart consul.service
- now below is result on same server01 of primary consul cluster .
#consul members -wan
Node Address Status Type Build Protocol DC Partition Segment
server01.<server_name>.primary_dc_name : alive server 1.11.1 2 primary_dc default
Consul info for both Client and Server
- consul version: v1.11.1 (both client and server)
- Architecture: primary cluster on VM and secondary cluster on Kubernetes.
- Federation is enabled using mesh gateways on both clusters.
- Consul connect Helm Chart version: 0.39.0
output from client ‘consul info’ command here
agent:
check_monitors = 0
check_ttls = 0
checks = 1
services = 1
build:
prerelease =
revision = 2c56447
version = 1.11.1
consul:
acl = disabled
known_servers = 3
server = false
runtime:
arch = amd64
cpu_count = 4
goroutines = 115
max_procs = 4
os = linux
version = go1.17.5
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 11
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 64
members = 8
query_queue = 0
query_time = 1
Server info
output from server ‘consul info’ command here
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 0
build:
prerelease =
revision = 2c56447
version = 1.11.1
consul:
acl = enabled
bootstrap = false
known_datacenters = 2
leader = false
leader_addr = 10.44.182.126:8300
server = true
raft:
applied_index = 6792
commit_index = 6792
fsm_pending = 0
last_contact = 7.283558ms
last_log_index = 6792
last_log_term = 11
last_snapshot_index = 0
last_snapshot_term = 0
latest_configuration = [{Suffrage:Voter ID:f70d584f-b24d-d5e6-6bc8-78f75ef4de90 Address:10.44.182.125:8300} {Suffrage:Voter ID:c25e8171-86d3-0e88-a2e5-36865157b4a0 Address:10.44.182.126:8300} {Suffrage:Voter ID:270f5410-3ea3-79e0-441a-d8451677084e Address:10.44.182.127:8300}]
latest_configuration_index = 0
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 11
runtime:
arch = amd64
cpu_count = 4
goroutines = 195
max_procs = 4
os = linux
version = go1.17.5
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 11
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 64
members = 8
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 7
intent_queue = 0
left = 0
member_time = 17
members = 6
query_queue = 0
query_time = 1
Operating system and Environment details
- CentOS7
- Consul version: v1.11.1
- Architecture: primary cluster on VM and secondary cluster on Kubernetes.
- Federation is enabled using mesh gateways on both clusters.
- Consul connect Helm Chart version: 0.39.0
Conclusion: seems like restarting consul service on the primary cluster resulted in losing federation between the two clusters.
Thanks in advance.
With Regards,
Bankat Vikhe