Mesh Gateway federation woes!

Greetings!

I’ve been struggling off and on for 2 weeks trying to get two Kubernetes clusters to federate in a mesh.

kubectl get proxydefaults global -n consul returns SYNCED=True, but consul members -wan shows a status of failed for members of the cluster on each side of mesh gateway.

It was my understanding that if you use the local mode for Mesh Gateway all traffic goes through the local gateway but members complain that they can’t reach members on the other side. What am I missing?

meshGateway:
    enabled: true
    replicas: 1
    service:
        enabled: true
        type: NodePort
    wanAddress:
        enabled: true

Hi @mister2d did you by chance configure a proxy default that directs all services to go through the local mesh gateway? Like the below configuration

apiVersion: consul.hashicorp.com/v1alpha1
kind: ProxyDefaults
metadata:
  name: global
spec:
  meshGateway:
    mode: local

We have a learn tutorial that guides practititoners on how to secure traffic across two k8s clusters. In this tutorial, mesh gateways in local mode are utilized.

# Secure Service Mesh Communication Across Kubernetes Clusters

Hi! @karl-cardenas-coding

Yes I am using that exact configuration. Does it matter that my k8s cluster binds to a non-routable interface? The IP range is in the 10.0.0.0/16 block which is where the Consul pods run. But I thought that didn’t matter if the mesh gateways were in local mode and show that they are in sync.

The Consul federation token shows this in the secret value:

{
    "primary_datacenter": "dc1",
    "primary_gateways": ["cluster1.example.com:30085"]
}

I’ve confirmed that cluster1.example.com:30085 is reachable with netcat.

How did you wan join the two datacenters? Also, can you share your datacenter configurations please?

Here are my helm chart values.

DC1:

global:
    datacenter: dc1
    name: consul
    domain: consul
    tls:
        enabled: true
        enableAutoEncrypt: true
        serverAdditionalDNSSANs:
            - "consul-server.consul.svc.cluster.local"
    federation:
        enabled: true
        createFederationSecret: true
    acls:
        manageSystemACLs: true
        createReplicationToken: true
    gossipEncryption:
        autoGenerate: true
    logJSON: true
connectInject:
    enabled: true
    default: false
controller:
    enabled: true
meshGateway:
    enabled: true
    replicas: 1
    service:
        enabled: true
        type: NodePort
        nodePort: 30085
    wanAddress:
        enabled: true
        HostNetwork: true
syncCatalog:
    enabled: true
    default: true
    toConsul: true
    toK8S: true
metrics:
    enabled: true
prometheus:
    enabled: true
ui:
    enabled: true
    service:
        type: NodePort
        nodePort:
            https: 30084
server:
    replicas: 3
    securityContext:
        runAsNonRoot: false
        runAsUser: 0
    service:
        type: NodePort
client:
    securityContext:
        runAsNonRoot: false
        runAsUser: 0

DC2:

global:
    datacenter: dc2
    name: consul
    domain: consul
    tls:
        enabled: true
        enableAutoEncrypt: true
        serverAdditionalDNSSANs:
            - "consul-server.consul.svc.cluster.local"
        caCert:
            secretName: consul-federation
            secretKey: caCert
        caKey:
            secretName: consul-federation
            secretKey: caKey
    acls:
        manageSystemACLs: true
        replicationToken:
            secretName: consul-federation
            secretKey: replicationToken
    federation:
        enabled: true
    gossipEncryption:
        secretName: consul-federation
        secretKey: gossipEncryptionKey
    logJSON: true
connectInject:
    enabled: true
    default: false
controller:
    enabled: true
meshGateway:
    enabled: true
    replicas: 1
    service:
        enabled: true
        type: NodePort
        nodePort: 30085
    wanAddress:
        enabled: true
syncCatalog:
    enabled: true
    default: true
    toConsul: true
    toK8S: true
metrics:
    enabled: true
prometheus:
    enabled: true
ui:
    enabled: true
    service:
        type: NodePort
        nodePort:
            https: 30084
server:
    replicas: 1
    securityContext:
        runAsNonRoot: false
        runAsUser: 0
    extraVolumes:
        - type: secret
          name: consul-federation
          items:
              - key: serverConfigJSON
                path: config.json
          load: true
client:
    securityContext:
        runAsNonRoot: false
        runAsUser: 0

@karl-cardenas-coding

I was able to figure it out. primary_gateways was set to the Pod’s service IP which was not routable to the other cluster. I had to set the wanAddress to a static value which reflected the FQDN of the wan interface. Only then did the clusters actually sync completely. When running a consul members -wan, all server nodes now report “alive” status.

Working helm values

DC1:

global:
    datacenter: dc1
    name: consul
    domain: consul
    tls:
        enabled: true
        enableAutoEncrypt: true
        serverAdditionalDNSSANs:
            - "consul-server.consul.svc.cluster.local"
    federation:
        enabled: true
        createFederationSecret: true
    acls:
        manageSystemACLs: true
        createReplicationToken: true
    gossipEncryption:
        autoGenerate: true
    logJSON: true
connectInject:
    enabled: true
    default: false
controller:
    enabled: true
meshGateway:
    enabled: true
    replicas: 1
    service:
        enabled: true
        type: NodePort
        nodePort: 30085
    wanAddress:
        enabled: true
        source: "Static"
        static: "dc1.example.com"
        port: 30085
syncCatalog:
    enabled: true
    default: true
    toConsul: true
    toK8S: true
metrics:
    enabled: true
prometheus:
    enabled: true
ui:
    enabled: true
    service:
        type: NodePort
        nodePort:
            https: 30084
server:
    replicas: 3
    securityContext:
        runAsNonRoot: false
        runAsUser: 0
    service:
        type: NodePort
client:
    securityContext:
        runAsNonRoot: false
        runAsUser: 0

DC2:

global:
    datacenter: dc2
    name: consul
    domain: consul
    tls:
        enabled: true
        enableAutoEncrypt: true
        serverAdditionalDNSSANs:
            - "consul-server.consul.svc.cluster.local"
        caCert:
            secretName: consul-federation
            secretKey: caCert
        caKey:
            secretName: consul-federation
            secretKey: caKey
    acls:
        manageSystemACLs: true
        replicationToken:
            secretName: consul-federation
            secretKey: replicationToken
    federation:
        enabled: true
    gossipEncryption:
        secretName: consul-federation
        secretKey: gossipEncryptionKey
    logJSON: true
connectInject:
    enabled: true
    default: false
controller:
    enabled: true
meshGateway:
    enabled: true
    replicas: 1
    service:
        enabled: true
        type: NodePort
        nodePort: 30085
    wanAddress:
        enabled: true
        source: "Static"
        static: "dc2.example.com"
        port: 30085
syncCatalog:
    enabled: true
    default: true
    toConsul: true
    toK8S: true
metrics:
    enabled: true
prometheus:
    enabled: true
ui:
    enabled: true
    service:
        type: NodePort
        nodePort:
            https: 30084
server:
    replicas: 1
    securityContext:
        runAsNonRoot: false
        runAsUser: 0
    extraVolumes:
        - type: secret
          name: consul-federation
          items:
              - key: serverConfigJSON
                path: config.json
          load: true
client:
    securityContext:
        runAsNonRoot: false
        runAsUser: 0

Nice! Is there a way for us to document this better?

I would say yes it should be documented better. I’ve seen a few topics around this same issue without a clear solution.

I’m guessing you’d like me to submit a PR. :sweat_smile: