Running Consul client in AKS , while Consul servers running in AWS

Hi,

We have Consul servers cluster already running in AWS in EC2. Now, we are extending some services to run in Azure AKS. To secure service-to-service connection, we are about to deploy Consul client in AKS. As AKS endpoint is private, I am wondering if Consul server would be able to connect to AKS API (In Azure, private endpoint could be reached only from vnet where AKS are). I read somewhere that

k8sAuthMethodHost should be set to the address of your Kubernetes API server so that the Consul servers can validate a Kubernetes service account token when using the Kubernetes auth method with consul login

Is private AKS endpoint obstacle in AKS services integration with Consul cluster in AWS?
If it is, what would be solution for me then, to integrate AKS services with Consul server. Should I deploy another Consul cluster in AKS, and connect it somehow with existing in AWS?

Thank you!

BR,
Miroslav

What I realized just now, consul client running in kubernetes, in order to join Consul cluster running outside of kubernetes, has to be on sam LAN with Consul cluster. That’s another reason why consul client running in AKS can not join consul cluster in AWS side. That’s what docs says at least.

For that use-case it’s best to run separate consul datacenters in each kube cluster, i.e. separate consul server clusters.

Thank you! I have 2 Consul clusters - existing one running in VM and another - new one - created in kubernetes (AKS), and facing some challenges to join those clusters (to do federation using mesh gateway). Can you please help me, what is common place to run mesh-gateway in VM Consul cluster? Should I create separate VM and install consul agent there, or I can host mesh-gateway in Consul server for example? I am wondering what port to use to start mesh-gateway on. My VM cluster is TLS enabled.

I have just realized that might not need mesh-gateway at all. We have primary Consul cluster running in AWS on VMs, and secondary in kubernetes (AKS) Azure. As we do have VPN between AWS and Azure, it should be possible to achieve federation between Consul datacenters using a single WAN gossip pool. That would mean no need for adding additional complexity with mesh-gateway.

Can your VMs route to Pod IPs? If so, then you’re correct that a mesh gateway is not needed.

Note that k8s federation through the Helm chart is really only supported via mesh gateways. Basically most of the federation features on k8s assume you’re using mesh gateways so if you don’t use mesh gateways you may run into some edge cases.

Hi @lkysow I realized that! I have now running one Consul cluster in VMs and another one in AKS cluster, connected through mesh gateway. Clusters joined and was working as expected. But now I am facing issue when enable ACL in helm chart. ACL is also enabled in VM cluster which is my primary Consul cluster. So, if I set

acls:
    manageSystemACLs: false

It’s working as expected. But if i change manageSystemACLs to true, it is failing to start pod. More precisely it is failing consul-connect-inject-init, this way.

kubectl logs pods/fleet-mgr --container=consul-connect-inject-init
2021-11-05T15:45:39.955Z [ERROR] Consul login failed; retrying: error="error logging in: Unexpected response code: 403 (rpc error making call: rpc error making call: Permission denied)"

This is my helm config:

global:
  name: azure
  datacenter: clusterwest
  tls:
    enabled: true
    caCert:
      secretName: consul-federation
      secretKey: caCert
    caKey:
      secretName: consul-federation
      secretKey: caKey

  acls:
    manageSystemACLs: true
    replicationToken:
      secretName: consul-federation
      secretKey: replicationToken

  federation:
    enabled: true

  gossipEncryption:
    secretName: consul-federation
    secretKey: gossipEncryptionKey

connectInject:
  enabled: true
controller:
  enabled: true
meshGateway:
  enabled: true
server:
  extraConfig: |
    {
      "primary_datacenter": "frame",
      "primary_gateways": ["10.242.89.235:19005"]
    }

And this is manifest file I am running my service from:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: fleet-mgr

---
apiVersion: v1
kind: Service
metadata:
  name: fleet-mgr
spec:
  selector:
    app: fleet-mgr
  ports:
  - protocol: TCP
    port: 3000
    targetPort: afm

---
apiVersion: v1
kind: Pod
metadata:
  name: fleet-mgr
  labels:
    app: fleet-mgr
  annotations:
    "consul.hashicorp.com/connect-inject": "true"
spec:
  serviceAccountName: miroslav
  containers:
  - image: "clusters-docker-remote.artifactory.re.in.fra.me/fleet-mgr:latest"
    name: fleet-mgr
    env:
      - name: MACHINE_ALLOC_SLEEP_TIME_MS
        value: "2000"
    ports:
        - containerPort: 3000
          name: afm

Can you help me understand why pod is failing to start when ACL is enabled in helm chart?

Hi, you’ll need to follow these instructions: Consul Servers Outside of Kubernetes - Kubernetes | Consul by HashiCorp

HI @lkysow Are you sure that I should follow instructions from the link above? I would say that it is not my use case, as referring to scenario when only Consul clients running in kubernetes, and needs to join Consul server cluster running in VM. My use case is different. Consul cluster running in kubernetes also - not only clients. And I have mesh gateway in both Consul clusters, and those clusters joined successfully. Kubernetes and VM Consul clusters in my use case are not on same LAN. VM cluster is running in AWS and AKS cluster is in Azure. But I have VPN connectivity in between and all Consul server/agents can reach each other.
When I have tried to adjust my helm config with instruction from your link above, I got following error when try to install helm chart with such config:

helm install azure hashicorp/consul -f config.yaml --wait
Error: INSTALLATION FAILED: execution error at (consul/templates/server-acl-init-job.yaml:2:65): only one of server.enabled or externalServers.enabled can be set

And I can understand this because I was trying to apply config which is not my use case.
This is helm config I run above helm chart install with:

global:
  name: azure
  datacenter: clusterwest
  tls:
    enabled: true
    caCert:
      secretName: consul-federation
      secretKey: caCert
    caKey:
      secretName: consul-federation
      secretKey: caKey

  acls:
    manageSystemACLs: true
    bootstrapToken:
      secretName: bootstrap-token
      secretKey: token
    replicationToken:
      secretName: consul-federation
      secretKey: replicationToken

  federation:
    enabled: true

  gossipEncryption:
    secretName: consul-federation
    secretKey: gossipEncryptionKey

connectInject:
  enabled: true
controller:
  enabled: true
meshGateway:
  enabled: true
server:
  extraConfig: |
    {
      "primary_datacenter": "frame",
      "primary_gateways": ["10.242.89.235:19005"]
    }
externalServers:
  enabled: true
  hosts:
    - 'provider=aws tag_key=consul_auto_join tag_value=sysazure'
  k8sAuthMethodHost: 'https://kubernetes-cluster:443'

I have also tried to set diff serviceAccount for pod like you proposed here,
and my service pod failed to start again. But I was able to see in the log that consul login was successful. It failed then with this error:

kubectl logs pods/fleet-mgr --container=consul-connect-inject-init
2021-11-05T16:48:51.169Z [INFO]  Consul login complete
2021-11-05T16:48:51.171Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"

What do you think? Any idea what I should do?

Oops, you’re right that link did not apply. I didn’t realize you had Consul servers running in both VMs and K8s.

Regarding the ACL issues I think the problem is that the serviceAccountName doesn’t match the name of the service being registered.

Here in your Pod spec:

  serviceAccountName: miroslav

Should instead match the name of the service, in your case I believe fleet-mgr.

That should work. You don’t need to change the default binding rule, I’d remove that config.

hi @lkysow. Thank you for your answer. I set serviceAccountName like you proposed, but still same error:

kubectl logs pods/fleet-mgr --container=consul-connect-inject-init
2021-11-08T11:02:14.119Z [INFO]  Consul login complete
2021-11-08T11:02:14.121Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:02:15.122Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:02:16.123Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"

Those are my current setup for service manifest and consul helm config
afm.yaml:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: fleet-mgr

---
apiVersion: v1
kind: Service
metadata:
  name: fleet-mgr
spec:
  selector:
    app: fleet-mgr
  ports:
  - protocol: TCP
    port: 3000
    targetPort: afm

---
apiVersion: v1
kind: Pod
metadata:
  name: fleet-mgr
  labels:
    app: fleet-mgr
  annotations:
    "consul.hashicorp.com/connect-inject": "true"
spec:
  serviceAccountName: fleet-mgr
  containers:
  - image: "clusters-docker-remote.artifactory.re.in.fra.me/fleet-mgr:latest"
    name: fleet-mgr
    env:
      - name: MACHINE_ALLOC_SLEEP_TIME_MS
        value: "2000"
    ports:
        - containerPort: 3000
          name: afm

config.yaml:

global:
  name: bmaas
  datacenter: bmaaswest
  tls:
    enabled: true
    caCert:
      secretName: consul-federation
      secretKey: caCert
    caKey:
      secretName: consul-federation
      secretKey: caKey

  acls:
    manageSystemACLs: true
    replicationToken:
      secretName: consul-federation
      secretKey: replicationToken

  federation:
    enabled: true

  gossipEncryption:
    secretName: consul-federation
    secretKey: gossipEncryptionKey

connectInject:
  enabled: true
controller:
  enabled: true
meshGateway:
  enabled: true
server:
  extraConfig: |
    {
      "primary_datacenter": "frame",
      "primary_gateways": ["10.242.89.235:19005"]
    }

Actually this is working now. It managed to connect in the end. Seemed I just had to wait for a bit longer:

root@west-vm:/home/azureuser# kubectl logs pods/fleet-mgr --container=consul-connect-inject-init
2021-11-08T11:41:06.161Z [INFO]  Consul login complete
2021-11-08T11:41:06.162Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:07.163Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:08.164Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:09.166Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:10.168Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:11.170Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:12.172Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:13.174Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:14.175Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:15.176Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:16.177Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:17.178Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:18.179Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:19.181Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:20.182Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:21.183Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:22.185Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:23.186Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:24.187Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:25.188Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:26.189Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:27.190Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:28.191Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:29.193Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:30.194Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:31.196Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:32.198Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:33.200Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:34.204Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:35.205Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-11-08T11:41:36.208Z [INFO]  Registered service has been detected: service=fleet-mgr-sidecar-proxy
2021-11-08T11:41:36.208Z [INFO]  Registered service has been detected: service=fleet-mgr
2021-11-08T11:41:36.208Z [INFO]  Connect initialization completed
    Successfully applied traffic redirection rules

Hi @lkysow
After successfully starting service pod with ACL enabled in kubernetes side I noticed that it is now failing to access kubernetes services from VM cluster UI. I am getting error 500 like in the picture when I try to switch to kubernetes datacenter where ACL is enabled. In the same time I can access kubernetes services in datacenter where ACL is disabled. I am logged in with bootstrap token in VM Consul GUI. Is there a way to have preview of services from datacenters where ACL is enabled from VM Consul UI? How I can come to the token which would have right to access another datacenter where ACL is enabled?

The 500 is unlikely due to ACLs. What is the URL in that screenshot? Usually a 500 is because the datacenter where the UI is hosted can’t talk to the other datacenter you’re selecting. The steps to debug that would be to look at the server logs in the datacenter where the UI is hosted.

Can you also run the Verifying Federation steps: Federation Between Kubernetes Clusters | Consul by HashiCorp

Yes, you are right. Not sure if it is only case with my env. but I have to restart consul service in Consul server VM where mesh gateway is hosted, after deploying new secondary datacenter (VM is my primary datacenter). Only then I do have connectivity with kubernetes cluster and then I also can see in the GUI all datacenters. Is that expected behavior?
Thank you a lot for your advices!

Is that expected behavior?

No, that’s not expected. Did you happen to catch the logs before/after the restart?

I attached some logs I captured around time when consul service restart happened in server where mesh-gw is hosted. Those are IPs of servers VM in primary Consul cluster:

(LAN) joining: lan_addresses=[10.242.89.235, 10.242.90.214, 10.242.88.195]

Primary cluster datacenter name is: frame.
Secondary Consul cluster running in kubernetes has following IPs:

azureeast2-server-0            1/1     Running   0          8h    10.253.176.44   
azureeast2-server-1            1/1     Running   0          8h    10.253.176.71   
azureeast2-server-2            1/1     Running   0          8h    10.253.176.18  

Datacenter name in secundary is ‘clustereast2’.
This is exec command I used in systemd to start mesh-gw in VM:

ExecStart=/usr/bin/consul connect envoy  -gateway=mesh -register -service "mesh-gateway-primary" -address "10.242.89.235:19005" -wan-address "10.242.89.235:19005" -grpc-addr=http
s://127.0.0.1:8502 -ca-file=/opt/frame/ssl/consul-agent-ca.pem -expose-servers -token=01d0e346-67dd-4644-ad79-7d83442b3bb0 

The error that I see quite often in VM in the log is:

[ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.253.160.54:8302: write tcp 10.242.89.235:53140->10.242.89.235:19005: write: broken pipe

where

10.242.89.235:19005

is mesh gateway address.

In fact, after deploying new (or recreating existing) secondary Consul cluster in kubernetes (AKS), those are the logs I see in Consul VM, and systemd service status for consul:

2021-11-09T11:09:52.359128+00:00 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:09:52.358Z [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.253.168.27:8300 datacenter=clusterwest method=Internal.ServiceDump error="rpc error getting client: failed to get conn: read tcp 10.242.89.235:36928->10.242.89.235:19005: read: connection reset by peer"
2021-11-09T11:09:55.243253+00:00 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:09:55.242Z [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.253.168.70:8300 datacenter=clusterwest method=Health.ServiceNodes error="rpc error getting client: failed to get conn: read tcp 10.242.89.235:36940->10.242.89.235:19005: read: connection reset by peer"
2021-11-09T11:09:58.937021+00:00 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:09:58.936Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.253.168.27:8302: read tcp 10.242.89.235:36966->10.242.89.235:19005: read: connection reset by peer
2021-11-09T11:09:59.236650+00:00 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:09:59.236Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send ping: read tcp 10.242.89.235:36970->10.242.89.235:19005: read: connection reset by peer
2021-11-09T11:09:59.954264+00:00 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:09:59.953Z [INFO]  agent.server.serf.wan: serf: EventMemberJoin: azure-server-1.clusterwest 10.253.168.66
2021-11-09T11:09:59.954500+00:00 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:09:59.954Z [INFO]  agent.server: Handled event for server in area: event=member-join server=azure-server-1.clusterwest area=wan
2021-11-09T11:10:00.244459+00:00 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:10:00.244Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.253.168.66:8302: read tcp 10.242.89.235:36982->10.242.89.235:19005: read: connection reset by peer


systemctl status consul -ll
● consul.service - "HashiCorp Consul - A service mesh solution"
   Loaded: loaded (/etc/systemd/system/consul.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2021-11-08 23:00:48 UTC; 12h ago
     Docs: https://www.consul.io/
 Main PID: 7764 (consul)
   CGroup: /system.slice/consul.service
           └─7764 /usr/bin/consul agent -config-dir=/etc/consul.d

Nov 09 11:15:04 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:15:04.939Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.253.168.27:8302: read tcp 10.242.89.235:38920->10.242.89.235:19005: read: connection reset by peer
Nov 09 11:15:05 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:15:05.747Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to forward ack: read tcp 10.242.89.235:38948->10.242.89.235:19005: read: connection reset by peer from=10.253.168.27:8302
Nov 09 11:15:05 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:15:05.747Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.253.168.70:8302: read tcp 10.242.89.235:38938->10.242.89.235:19005: read: connection reset by peer
Nov 09 11:15:08 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:15:08.682Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.253.168.70:8302: read tcp 10.242.89.235:38958->10.242.89.235:19005: read: connection reset by peer
Nov 09 11:15:09 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:15:09.940Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.253.168.66:8302: read tcp 10.242.89.235:38964->10.242.89.235:19005: read: connection reset by peer
Nov 09 11:15:09 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:15:09.943Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.253.168.27:8302: read tcp 10.242.89.235:38974->10.242.89.235:19005: read: connection reset by peer
Nov 09 11:15:10 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:15:10.369Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.253.168.70:8302: read tcp 10.242.89.235:38978->10.242.89.235:19005: read: connection reset by peer
Nov 09 11:15:10 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:15:10.437Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to send gossip to 10.253.168.27:8302: read tcp 10.242.89.235:38982->10.242.89.235:19005: read: connection reset by peer
Nov 09 11:15:11 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:15:11.735Z [ERROR] agent.server.rpc: RPC failed to server in DC: server=10.253.168.70:8300 datacenter=clusterwest method=Health.ServiceNodes error="rpc error getting client: failed to get conn: read tcp 10.242.89.235:39000->10.242.89.235:19005: read: connection reset by peer"
Nov 09 11:15:11 frame-consul10-242-89-235 consul[7764]: 2021-11-09T11:15:11.736Z [ERROR] agent.server.memberlist.wan: memberlist: Failed to forward ack: read tcp 10.242.89.235:38986->10.242.89.235:19005: read: connection reset by peer from=10.253.168.70:8302

And only restarting consul service in VM I can reach service running in newly deployed kubernetes Consul cluster.

Consul-VM-AKS-federation-logs.txt (64.7 KB)