Hi,
I’ve setup 2 federated clusters as per Federation Between Kubernetes Clusters | Consul by HashiCorp instructions. Everything seems to be working and 2 consul cluster communicate properly via mesh gateway. The problem comes when I try to make 2 services communicate across datacenters.
The problem is similar to Problems with Consul Connect + Mesh Gateways but there was no solution to it.
The config as follows:
- consul v1.11.1
- Two WAN connected datacenters
- ACL enabled + replication
- TLS enabled
- Connect enabled
- Envoy Mesh gateways deployed dc1 and dc2.
- Connection in both directions for gateways
- Both gateway shows healthy and passing all checks
- Services also shows health and passing all checks
- I can list services from opposite datacenters
Deployed static-client in DC1 and static-server in DC2 as per Secure Service Mesh Communication Across Kubernetes Clusters | Consul - HashiCorp Learn, but static-client fails to communicate to static-server. Quiring clusters on static-client envoy shows that remote static-server cluster has health_flags::/failed_eds_health setting, which is coming from consul.
I’ve setup debug environment and traced it to code below in the agent/xds/endpoints.go
**overallHealth := envoy_core_v3.HealthStatus_UNHEALTHY**
for _, ep := range realEndpoints {
health, _ := calculateEndpointHealthAndWeight(ep, target.Subset.OnlyPassing)
if health == envoy_core_v3.HealthStatus_HEALTHY {
overallHealth = envoy_core_v3.HealthStatus_HEALTHY
break
}
}
What I see is that realEndpoints is empty array and the health of the remote endpoint is always set to unhealthy as it never goes into the for loop and that is what returned to envoy. I recompiled it with HealthStatus_HEALTHY and everything started to work - 2 services can communicate, obviously not a proper solution. For now could not figure out where realEndpoints for remote service coming from.
So, it looks like the remote endpoints somehow not populated in proxy snapshot, may be due to misconfiguration or there is some bug in the code. By the way, services in the same datacenter communicate just fine.
Any insight where to look would be appreciated.
Thanks