Issue on mesh gateway (Vm) on wan federated consul cluster (k8s<->vm)

Hello,

Context

I mount a consul WAN federated cluster with mesh gateway between kubernetes (as primary) and VMs, and I am not able to call services hosted on VMs from kubernetes, but able to call service hosted on kubernetes from vms only in mesh gateway mode “remote”.
My issue seems to be the mesh gateway on VMs, I explain my setup and my issue:

All servers of both cluster are in the same network.
On kubernetes everything working fine I installed consul via HELM only ACL and gossip encryption aren’t activated yet.
I have two consul cluster kube-dev-dc1 on kubernetes cluster, and dc1 on Vms.

Consul deployment with HELM

I don’t use the last version of helm and consul-k8s and envoy in order to have the same version of consul in k8s and in VMs. ( I deployed envoy 1.18.2 from tetrate repos)

Version of consul helm package 0.31.1,
helm values:

connectInject:
  enabled: true
controller:
  enabled: true
global:
  datacenter: kube-dev-dc1
  federation:
    createFederationSecret: true
    enabled: true
    primaryDatacenter: kube-dev-dc1
  image: my-private-repos/hashicorp/consul:1.10.0
  imageEnvoy: my-private-repos/envoyproxy/envoy-alpine:v1.18.2
  imageK8S: my-private-repos/hashicorp/consul-k8s:0.25.0
  imagePullSecrets:
  - name: artifactory-cred
  logJSON: true
  logLevel: debug
  name: consul-dev
  serverAdditionalDNSSANs:
  - consul-dev-server.consul.svc.cluster.local
  tls:
    enableAutoEncrypt: true
    enabled: true
    verify: true
meshGateway:
  enabled: true
  replicas: 1
  service:
    enabled: true
    nodePort: 30001
    port: 8443
    type: NodePort
  wanAddress:
    service:
      nodePort: 30000
      port: 8443
      type: NodePort
server:
  replicas: 3
  storageClass: trident-storageclass-nas-core-dev
ui:
  enabled: true

Services on k8s

I deployed 2 services with sidecar injection:
svc_kube_client : a curl image used to call others services on k8s and Vms
svc_kube_server : a http-echo image to answer “hello I am svc_kube_server” ( to limit the lengh of this topic, it will not described here, I used the same code as static-server herehttps://www.consul.io/docs/k8s/connect.)

Here the configuration item of client on kubernetes:

ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
  name: svc_kube_client
---
 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: svc_kube_client
spec:
  replicas: 1
  selector:
    matchLabels:
      app: svc_kube_client
  template:
    metadata:
      name: svc_kube_client
      labels:
        app: svc_kube_client
      annotations:
        consul.hashicorp.com/connect-inject: 'true'
        consul.hashicorp.com/transparent-proxy: 'false'
        consul.hashicorp.com/connect-service-upstreams: 'svc_vm_server:4520:dc1,svc_kube_server:1234:kube-dev-dc1'
        consul.hashicorp.com/connect-service: svc_kube_client'
    spec:
      containers:
        - name: svc_kube_client
          image: my-private-repos/curlimages/curl:7.80.0'
          command: ['/bin/sh', '-c', '--']
          args: ['while true; do sleep 30; done;']
      imagePullSecrets:
      - name: artifactory-cred

I am able to call svc_kube_server from svc_kube_client via curl on port exposed by sidecar

curl 127.0.0.1:1234
hello I am svc_kube_server

Consul on VMs

I have a consul WAN federated cluster so on VM I have 3 servers VM with mesh gateway and 1 agent VM. Envoy is installed on vm and I use it as sidecar.
Consul version installed on Vms is 1.10.0,
Envoy installed on VMs is 1.18.2 (from tetrate repos)

I have 3 server and few agents

Server is configured like following :

datacenter = "dc1"
data_dir = "/var/lib/consul"
server = true
log_level = "INFO"
bootstrap_expect = 3

ports {
  https = 8501
  http = 8500
  grpc = 8502
}
cert_file= "/consul/cert/dc1-server-consul-0.pem"
key_file = "/consul/cert/dc1-server-consul-0-key.pem"
ca_file = "/consul/cert/consul-agent-ca.pem"
verify_incoming = true
verify_outgoing = true
verify_server_hostname = true

primary_gateways = ["1.2.3.4:30001"]
enable_central_service_config = true
primary_datacenter = "kube-dev-dc1"
connect {
  enabled = true
  enable_mesh_gateway_wan_federation = true
}
ui_config {
  enabled = true
}

And agent:

datacenter = "dc1"
data_dir = "/var/lib/consul"
server = false
log_level = "INFO"
bootstrap = false

ports {
  https = 8501
  http = 8500
  grpc = 8502
}

cert_file= "/consul/cert/dc1-client-consul-0.pem"
key_file = "/consul/cert/dc1-client-consul-0-key.pem"
ca_file = "/consul/cert/consul-agent-ca.pem"
verify_incoming = true
verify_outgoing = true
verify_server_hostname = true

Services on VMs

As for kubernetes cluster I deployed 2 services a client and a server
svc_vm_client
svc_vm_server

Both services are basic python app client-server, I use one for call k8s and Vm services, the other one return “hello I am svc_vm_server”.

$ cat svc_vm_client.hcl
service {
  name = "svc_vm_client"
  port = 8081
  id = "svc_vm_client_1"
  connect {
    sidecar_service  {
      proxy {
        mesh_gateway {
          mode = "local"
        }
        upstreams = [
          {
            destination_name = "svc_kube_server"
            datacenter = "kube-dev-dc1"
            local_bind_port = 5005
          },
          {
            destination_name = "svc_vm_server"
            datacenter = "dc1"
            local_bind_port = 5000
          }
        ]
      }
    }
  }
}

envoy daemon

Envoy sidecar is started by a daemon like this :

[Unit]
Description=svc_vm_server_sidecar
After=network.target
 
[Service]
Type=simple
Environment=CONSUL_HTTP_ADDR=https://127.0.0.1:8501 # needed to avoid errors
Environment=CONSUL_GRPC_ADDR=https://127.0.0.1:8502 # needed to avoid grpc errors
ExecStart=/bin/sh -c '/usr/bin/consul connect envoy -sidecar-for svc_vm_server_1 -admin-bind 127.0.0.1:19004  -ca-file=/var/lib/consul/cert/consul-agent-ca.pem -client-cert=/var/lib/consul/cert/dc1-client-consul-0.pem -client-key=/var/lib/consul/cert/dc1-client-consul-0-key.pem > /var/log/svc_vm_server_sidecar/svc_vm_server_sidecar.log 2>&1'
Restart=on-failure
SyslogIdentifier=svc_vm_server_sidecar
 
[Install]
WantedBy=multi-user.target

I am able to call svc_vm_server from svc_vm_client via curl directly its own port

$ curl 127.0.0.1:8000
hello I am svc_vm_server

or on port exposed by envoy sidecar:

$ curl 127.0.0.1:5000
hello I am svc_vm_server

Consul Wan federation test :
Consul member –wan works fine and sow all expected server alive.
UI show both cluster and expected services in both cluster.

WAN federation tests

Ok now I want:

  • to call services hosted on kubernetes from services hosted on VMs
  • to call services hosted on VMs from services hosted on kubernetes

On kubernete:

I deployed proxy-defaults from mesh gateway with mode : local.
I try to curl server instance hosted on VMs:

$ curl 127.0.0.1:4520 -v
*   Trying 127.0.0.1:4520...
* Connected to 127.0.0.1 (127.0.0.1) port 4520 (#0)
> GET / HTTP/1.1
> Host: 127.0.0.1:4520
> User-Agent: curl/7.80.0-DEV
> Accept: */*
>
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

I changed mode to remote same result.

On VM:

On VM I tried with remote in first step (on config file):

call to svc_kube_server with port exposed by sidecar

curl 127.0.0.1:5005
hello I am svc_kube_server

It works fine, but my goal is to pass through both gateways, so I change mode = “remote” to mode = “local”, I unregister and register again the service, then restart sidecar on vm to update the change and finally call service hosted on kubernetes again.

$ curl 127.0.0.1:5005 –v
curl: (56) Recv failure: Connection reset by peer
/ $ curl 127.0.0.1:5005-v
*   Trying 127.0.0.1:5005...
* Connected to 127.0.0.1 (127.0.0.1) port 5005 (#0)
> GET / HTTP/1.1
> Host: 127.0.0.1:5005
> User-Agent: curl/7.80.0-DEV
> Accept: */*
>
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

It’s the same error as it occurred on kubernetes when I tried to call service hosted on VM.
If it works fine with “remote” mode (bypassing VM meshgateway), but it doesn’t works with “local” mode it mean the mesh gateway doesn’t work correctly, and it explain why call from service on kubernetes it doesn’t works too but I didn’t see any error logs on meshgateway on vm and on different logs for consul server, agent, and sidecar on VMs, on curl fail.

I don’t know what cause the issue, and how to find the root cause, I am stuck.

Can you advise some investigation or any solution?

Thank you in advance.

Hi, thanks for the detailed report. I’d like to see how Envoy is configured at each point: vm service sidecar => vm mesh gw => kube mesh gw.

To do so, can you curl the admin port (usually localhost:19000) at /clusters for the vm service sidecar, vm mesh gw, and kube mesh gw, and share the results.

Hello Ikysow,

Thank you for your quick reply.

Sure you can retrieve curl results here:

envoy sidecar of the app on vm (i set mode = local on upstream config):
envoy sidecar of app mode local.txt (10.3 KB)

envoy sidecar of the app on vm (i set mode = remote on upstream config):
envoy sidecar of app mode remote.txt (10.5 KB)

mesh gateway envoy sidecar on vm:
mesh gateway envoy sidecar on vm.txt (7.0 KB)

mesh gateway envoy sidecar on kubernetes cluster (I used wget in the container):
mesh gateway envoy sidecar on kube.txt (30.8 KB)

Thank you for your support.

Regards,
Hedi.

Hi Hedi,
So I think the problem is that the VM mesh gateway (mgw) doesn’t have any of the expected clusters that allow it to route to the Kube mesh gateway. It should have a cluster that looks like kube-dev-dc1.internal.67cec1e7-b08a-0ae4-95c8-f1ecde5c7bf9.consul.

Requests that land on this gateway should be matched by that cluster and then forwarded to the kube mesh gateway.

  1. Can you run through the verification steps again (Federation Between Kubernetes Clusters | Consul by HashiCorp) and confirm that running consul catalog services -datacenter kube-dev-dc1 from the VM datacenter returns the correct list of services from kube (that should include its mesh gateway).
  2. Can you include the logs of the Consul client in the VM datacenter where the mgw is running.
  3. Can you include the logs of the VM mgw.
  4. Can you confirm that 1.2.3.4:30001 is the correct address for the VM mesh gateway.

hello,

thank you for your reply.
I will complete my answer earlier, but quickly :
I can confirm for point 1 :

consul catalog services -datacenter kube-dev-dc1
consul
svc_vm_client
svc_vm_client-sidecar-proxy
svc_vm_server
svc_vm_server-sidecar-proxy
mesh-gateway

2 & 3 = I will add logs erlier (asap)

4 1. Can you confirm that 1.2.3.4:30001 is the correct address for the VM mesh gateway.
Honestly, I anonymize logs but I kept mapping between real IPs and anonymized IPs.
i used 1.2.3.4 IP in my first post, so I kept 1.2.3.4 for ip of kubernetes mesh gateway

primary_gateways = [“1.2.3.4:30001”] on server config file for VM and my primary is kubernetes, i didn’t specify it but ‘primary_datacenter = “kube-dev-dc1”’ is set on same file.

Regards.
Hedi.

hello Ikysow,

Maybe there is an error on point 2 :
Can you include the logs of the Consul client in the VM datacenter where the mgw is running.
=> mesh gateway is deployed on server where is installed consul server.
on server where are deployed consui client, there is no mesh gateway deployed.
I tried it but on consul server start, it show an error message to indicate that with these configuration :
connect {
enabled = true
enable_mesh_gateway_wan_federation = true
}

server field must be equals to false :
2022-09-05T16:15:02.066+0200 [INFO] agent: Exit code: code=0
==> ‘connect.enable_mesh_gateway_wan_federation = true’ requires ‘server = true’

So I don’t understand, if I did an error (mesh gaeways are only on consul servers), or your question 2 is :
Can you include the logs of the Consul “server” in the VM datacenter where the mgw is running?

To be clear:

  • On VM-01 is installed:
    • consul “server” (consul with configuration for consul server + meshgateway)
    • envoy of mesh gateway
  • On VM-02 is installed:
    • consul “server” (consul with configuration for consul server + meshgateway)
    • envoy of mesh gateway
  • On VM-03 is installed:
    • consul “server” (consul with configuration for consul server + meshgateway)
    • envoy of mesh gateway
  • On VM-01 is installed:
    • consul “client” ( consul with configuration for consul client/agent)
    • svc_vm_server
    • svc_vm_client
    • envoy of svc_vm_server
    • envoy of svc_vm_client

in attachment :

Regards,
Hedi.

Mesh gateways can run on the same node as Consul servers or on a node with a Consul client. Consul servers act like Consul clients in that they can also have services and gateways registered to them.

The enable_mesh_gateway_wan_federation config only applies to Consul servers but once you have that set on all the servers, the mesh gateways can run on servers or clients. If they’re running on clients, then they’ll still work even though the client config doesn’t have enable_mesh_gateway_wan_federation set because the servers will have that set.

Hello Ikysow,

Thank you for your answer. My issue is fixed now.

Regards,
Hedi Miladi.