Service to service communication problems in federated Kubernetes clusters

Hi,

I’ve setup 2 federated clusters as per Federation Between Kubernetes Clusters | Consul by HashiCorp instructions. Everything seems to be working and 2 consul cluster communicate properly via mesh gateway. The problem comes when I try to make 2 services communicate across datacenters.
The problem is similar to Problems with Consul Connect + Mesh Gateways but there was no solution to it.

The config as follows:

  • consul v1.11.1
  • Two WAN connected datacenters
  • ACL enabled + replication
  • TLS enabled
  • Connect enabled
  • Envoy Mesh gateways deployed dc1 and dc2.
  • Connection in both directions for gateways
  • Both gateway shows healthy and passing all checks
  • Services also shows health and passing all checks
  • I can list services from opposite datacenters

Deployed static-client in DC1 and static-server in DC2 as per Secure Service Mesh Communication Across Kubernetes Clusters | Consul - HashiCorp Learn, but static-client fails to communicate to static-server. Quiring clusters on static-client envoy shows that remote static-server cluster has health_flags::/failed_eds_health setting, which is coming from consul.

I’ve setup debug environment and traced it to code below in the agent/xds/endpoints.go

	**overallHealth := envoy_core_v3.HealthStatus_UNHEALTHY**
	for _, ep := range realEndpoints {
		health, _ := calculateEndpointHealthAndWeight(ep, target.Subset.OnlyPassing)
		if health == envoy_core_v3.HealthStatus_HEALTHY {
			overallHealth = envoy_core_v3.HealthStatus_HEALTHY
			break
		}
	}

What I see is that realEndpoints is empty array and the health of the remote endpoint is always set to unhealthy as it never goes into the for loop and that is what returned to envoy. I recompiled it with HealthStatus_HEALTHY and everything started to work - 2 services can communicate, obviously not a proper solution. For now could not figure out where realEndpoints for remote service coming from.

So, it looks like the remote endpoints somehow not populated in proxy snapshot, may be due to misconfiguration or there is some bug in the code. By the way, services in the same datacenter communicate just fine.

Any insight where to look would be appreciated.

Thanks

Hey @tcherkv

First I wanted to confirm a couple of things.

Have you configured the mesh gateway mode to be “local” so that the communication between services goes through the gateways? See Federation Between Kubernetes Clusters | Consul by HashiCorp.

Another thing that might be causing problems is not having an intention for the services. Since you have ACLs enabled, by default all services will not be allowed to talk to each other unless you have an intention.

Hi,

Yes, it is configured as local in ProxyDefaults and I have Intention.

apiVersion: consul.hashicorp.com/v1alpha1
kind: ProxyDefaults
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"consul.hashicorp.com/v1alpha1","kind":"ProxyDefaults","metadata":{"annotations":{},"labels":{"app.kubernetes.io/instance":"consul-aws-us-east-1-swarmio-matrix-dev"},"name":"global","namespace":"consul"},"spec":{"meshGateway":{"mode":"local"}}}
  creationTimestamp: "2022-01-04T21:43:47Z"
  finalizers:
  - finalizers.consul.hashicorp.com
  generation: 3
  labels:
    app.kubernetes.io/instance: consul-aws-us-east-1-swarmio-matrix-dev
  name: global
  namespace: consul
  resourceVersion: "11471181"
  uid: 911c0cc5-7462-4140-9c6e-7754c777885f
spec:
  expose: {}
  meshGateway:
    mode: local
status:
  conditions:
  - lastTransitionTime: "2022-01-14T18:48:29Z"
    status: "True"
    type: Synced
  lastSyncedTime: "2022-01-14T18:48:29Z"
apiVersion: consul.hashicorp.com/v1alpha1
kind: ServiceIntentions
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"consul.hashicorp.com/v1alpha1","kind":"ServiceIntentions","metadata":{"annotations":{},"name":"static-client-to-static-server","namespace":"default"},"spec":{"destination":{"name":"static-server"},"sources":[{"action":"allow","name":"static-client"}]}}
  creationTimestamp: "2021-12-31T17:36:53Z"
  finalizers:
  - finalizers.consul.hashicorp.com
  generation: 1
  name: static-client-to-static-server
  namespace: default
  resourceVersion: "11471206"
  uid: b69ee3fa-b59f-420d-a03c-c65875304444
spec:
  destination:
    name: static-server
  sources:
  - action: allow
    name: static-client
status:
  conditions:
  - lastTransitionTime: "2022-01-14T18:48:33Z"
    status: "True"
    type: Synced
  lastSyncedTime: "2022-01-14T18:48:33Z"

And when forcing health in the code client communicates and mesh IP address/port is populated in envoy cluster

static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::hostname::
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::health_flags::healthy
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::weight::1
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::region::
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::zone::
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::sub_zone::
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::canary::false
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::priority::0
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::success_rate::-1.0
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::local_origin_success_rate::-1.0
bash-4.4# 
bash-4.4# 
bash-4.4# 
bash-4.4# curl http://localhost:8080
"hello world"
bash-4.4# 

When I revert source code changes, I get below:

static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::hostname::
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::health_flags::/failed_eds_health
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::weight::1
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::region::
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::zone::
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::sub_zone::
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::canary::false
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::priority::0
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::success_rate::-1.0
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::10.44.0.89:8443::local_origin_success_rate::-1.0
bash-4.4# 
bash-4.4# curl http://localhost:8080
curl: (56) Recv failure: Connection reset by peer
bash-4.4# 

The Ip/port of mesh is still populated but the health is set to failed_eds_health.

When I set mesh mode to remote, the IP of the mesh gateway isnt populated, which tells me my mesh settings are taken properly:

static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::outlier::success_rate_average::-1
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::outlier::success_rate_ejection_threshold::-1
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::outlier::local_origin_success_rate_average::-1
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::outlier::local_origin_success_rate_ejection_threshold::-1
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::default_priority::max_connections::1024
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::default_priority::max_pending_requests::1024
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::default_priority::max_requests::1024
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::default_priority::max_retries::3
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::high_priority::max_connections::1024
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::high_priority::max_pending_requests::1024
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::high_priority::max_requests::1024
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::high_priority::max_retries::3
static-server.default.eu-south-1.internal.dfb720a7-b62c-feea-1e30-9a68a9957bc8.consul::added_via_api::true

It is clearly something related to calculating the endpoint health which is calculated as health of mesh gateway and realEndpoints in XDS and realEndpoints health not being available or not populated properly in client datacenter.

Thanks

Hi,

Attaching trace log from the client in the static-client (source) datacenter

consul-client-trace.txt (92.8 KB)

Aslo, the result of API request to v1/health/connect/static-server?dc=eu-south-1

  {
    "Node": {
      "ID": "31cc238a-748e-cc29-393c-03fb8fdb1e82",
      "Node": "eu-south-1-swarmio-matrix-dev-002.swarmio.internal",
      "Address": "10.42.1.4",
      "Datacenter": "eu-south-1",
      "TaggedAddresses": {
        "lan": "10.42.1.4",
        "lan_ipv4": "10.42.1.4",
        "wan": "10.42.1.4",
        "wan_ipv4": "10.42.1.4"
      },
      "Meta": {
        "consul-network-segment": "",
        "host-ip": "10.10.1.51",
        "pod-name": "eu-south-1-matrix-dev-q4t4f"
      },
      "CreateIndex": 46,
      "ModifyIndex": 49
    },
    "Service": {
      "Kind": "connect-proxy",
      "ID": "static-server-789bbd78bd-zwvft-static-server-sidecar-proxy",
      "Service": "static-server-sidecar-proxy",
      "Tags": [],
      "Address": "10.42.1.98",
      "TaggedAddresses": {
        "consul-virtual": {
          "Address": "240.0.0.3",
          "Port": 20000
        },
        "lan_ipv4": {
          "Address": "10.42.1.98",
          "Port": 20000
        },
        "wan_ipv4": {
          "Address": "10.42.1.98",
          "Port": 20000
        }
      },
      "Meta": {
        "k8s-namespace": "default",
        "k8s-service-name": "static-server",
        "managed-by": "consul-k8s-endpoints-controller",
        "pod-name": "static-server-789bbd78bd-zwvft"
      },
      "Port": 20000,
      "Weights": {
        "Passing": 1,
        "Warning": 1
      },
      "EnableTagOverride": false,
      "Proxy": {
        "DestinationServiceName": "static-server",
        "DestinationServiceID": "static-server-789bbd78bd-zwvft-static-server",
        "LocalServiceAddress": "127.0.0.1",
        "LocalServicePort": 8080,
        "Mode": "",
        "MeshGateway": {
          "Mode": "local"
        },
        "Expose": {}
      },
      "Connect": {},
      "CreateIndex": 372579,
      "ModifyIndex": 372579
    },
    "Checks": [
      {
        "Node": "eu-south-1-swarmio-matrix-dev-002.swarmio.internal",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": "",
        "ServiceTags": [],
        "Type": "",
        "Interval": "",
        "Timeout": "",
        "ExposedPort": 0,
        "Definition": {},
        "CreateIndex": 46,
        "ModifyIndex": 46
      },
      {
        "Node": "eu-south-1-swarmio-matrix-dev-002.swarmio.internal",
        "CheckID": "service:static-server-789bbd78bd-zwvft-static-server-sidecar-proxy:1",
        "Name": "Proxy Public Listener",
        "Status": "passing",
        "Notes": "",
        "Output": "TCP connect 10.42.1.98:20000: Success",
        "ServiceID": "static-server-789bbd78bd-zwvft-static-server-sidecar-proxy",
        "ServiceName": "static-server-sidecar-proxy",
        "ServiceTags": [],
        "Type": "tcp",
        "Interval": "",
        "Timeout": "",
        "ExposedPort": 0,
        "Definition": {},
        "CreateIndex": 372579,
        "ModifyIndex": 372590
      },
      {
        "Node": "eu-south-1-swarmio-matrix-dev-002.swarmio.internal",
        "CheckID": "service:static-server-789bbd78bd-zwvft-static-server-sidecar-proxy:2",
        "Name": "Destination Alias",
        "Status": "passing",
        "Notes": "",
        "Output": "All checks passing.",
        "ServiceID": "static-server-789bbd78bd-zwvft-static-server-sidecar-proxy",
        "ServiceName": "static-server-sidecar-proxy",
        "ServiceTags": [],
        "Type": "alias",
        "Interval": "",
        "Timeout": "",
        "ExposedPort": 0,
        "Definition": {},
        "CreateIndex": 372579,
        "ModifyIndex": 372588
      }
    ]
  }
]

Hi.
I’m facing a very similar issue. did you ever get this resolved and how?