Helm chart install fails in Azure AKS - attempting mesh proof of concept

I am attempting to set up a proof of concept of mesh gateway/terminating gateway, and am using Azure’s AKS service. When I attempt to install the helm chart mentioned in

helm install -f helm-consul-values.yaml consul hashicorp/consul --wait

The install fails with:

Error: serviceaccounts “consul-tls-init” already exists

Anyone else have difficulties with this?

Hi, thank you for your question, and I’m sorry you’re running into problems with the helm installation.

Could you provide a copy of the helm-consul-values.yaml that you used so I can try to reproduce it on my end?

Also, did you attempt a previous installation on the cluster which failed? You might need to do a little cleanup before proceeding in this case!

Hi, thanks for your help - the values are below - I used the values provided in the web site.

I’ve attempted installs of other consul products using docker/kube tools, and while I expended significant effort ensuring everything was cleared out, I am relatively new to k8s (particularly on Azure) and am happy to get any tips on anything I might have missed.

global:
name: consul
image: consul:1.8.0
imageK8S: hashicorp/consul-k8s:0.16.0
datacenter: dc1
federation:
enabled: true
createFederationSecret: true
tls:
enabled: true
meshGateway:
enabled: true
connectInject:
enabled: true

Great, thanks!
I’ll give this yaml a try on my end and get back to you as soon as I can, although I suspect that you just need a proper cleanup and that may unblock you.

I usually run the following on my end to clean up while I’m doing development as sometimes I have things in bad states:

  • helm del consul [where consul is your global.name, and assuming helm3]
  • kubectl delete pvc -l "release=consul"
  • kubectl get secret | grep $1- | cut -d' ' -f1 | xargs -I{} kubectl delete secret {}
    NOTE: do not run this on production systems, instead manually remove consul secrets

You may want to also double-check that you do not have any stale serviceaccounts via kubectl get serviceaccounts and remove any consul ones.
Generally speaking if kubectl get all isnt showing anything from consul you’re good to go!

Hi, thanks for those tips - right now, kubectl get all shows only:

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.0.0.1 443/TCP 94m

Which currently cannot be removed.

The only secret remaining is one that keeps recreating itself and appears to be the creation of a “default” serviceaccount:

[jburns@reachback-mysql mesh-take2]$ kubectl describe serviceaccount default
Name: default
Namespace: default
Labels:
Annotations:
Image pull secrets:
Mountable secrets: default-token-2w9ts
Tokens: default-token-2w9ts
Events:

Should this service account be removed? That’s the only thing that remained the last time; everything else was cleared. Most of it had to be cleared manually before “helm delete” would work.

The default-token serviceaccount isn’t part of Consul, so no need to mess with that, it’s coming from Kubernetes.
In theory you should be able to install at this point.

I’ll keep looking into it on my end and get back to you as soon as I have more information.

Thanks. FYI, I did a “describe” of the failed pods before I wiped things clean again:

consul-server-0:

Events:
Type Reason Age From Message


Warning FailedScheduling 44m (x5 over 44m) default-scheduler pod has unbound immediate PersistentVolumeClaims
Warning FailedScheduling 3m53s (x28 over 43m) default-scheduler 0/1 nodes are available: 1 node(s) didn’t match pod affinity/anti-affinity, 1 node(s) didn’t satisfy existing pods anti-affinity rules.

consul-mesh-gateway-646ffcbddd-zh6s8:

Events:
Type Reason Age From Message


Normal Scheduled 48m default-scheduler Successfully assigned default/consul-mesh-gateway-646ffcbddd-zh6s8 to aks-nodepool1-38275208-vmss000000
Normal Pulled 48m kubelet, aks-nodepool1-38275208-vmss000000 Container image “consul:1.8.0” already present on machine
Normal Created 48m kubelet, aks-nodepool1-38275208-vmss000000 Created container copy-consul-bin
Normal Started 48m kubelet, aks-nodepool1-38275208-vmss000000 Started container copy-consul-bin
Normal Pulled 47m (x5 over 48m) kubelet, aks-nodepool1-38275208-vmss000000 Container image “hashicorp/consul-k8s:0.16.0” already present on machine
Normal Created 46m (x5 over 48m) kubelet, aks-nodepool1-38275208-vmss000000 Created container service-init
Normal Started 46m (x5 over 48m) kubelet, aks-nodepool1-38275208-vmss000000 Started container service-init
Warning BackOff 3m42s (x202 over 48m) kubelet, aks-nodepool1-38275208-vmss000000 Back-off restarting failed container

FYI, as this is a proof-of-concept, I am not observing best practices for production - I just have a single node and am trying to be efficient with the use of Azure resources.

Ah, this may not be the direct cause of the initial issue, but surely the second issue you’re seeing is a misconfiguration. For a single node installation you need to also set :
server.replicas : 1
and
server.bootstrapExpect : 1

Thanks, will alter the .yaml file and try a re-install. Also getting this:

~ kubectl logs consul-connect-injector-webhook-deployment-8d678b78b-zf2ps
flag provided but not defined: -init-container-memory-limit

Have the flags changed since the containers were built?

kubectl describe pod consul-connect-injector-webhook-deployment-8d678b78b-qttvv
Events:
Type Reason Age From Message


Normal Scheduled 4m50s default-scheduler Successfully assigned default/consul-connect-injector-webhook-deployment-8d678b78b-qttvv to aks-nodepool1-38275208-vmss000000
Warning BackOff 3m33s (x10 over 4m44s) kubelet, aks-nodepool1-38275208-vmss000000 Back-off restarting failed container
Normal Pulled 3m18s (x5 over 4m49s) kubelet, aks-nodepool1-38275208-vmss000000 Container image “hashicorp/consul-k8s:0.16.0” already present on machine
Normal Created 3m17s (x5 over 4m49s) kubelet, aks-nodepool1-38275208-vmss000000 Created container sidecar-injector
Normal Started 3m17s (x5 over 4m48s) kubelet, aks-nodepool1-38275208-vmss000000 Started container sidecar-injector

Hi! Yes, there have been breaking changes around 0.18 of consul-k8s that now require container resource limits.

I’d recommend using the latest consul-helm and removing the two image: and imageK8S: entries in your yaml file and it will grab the latest, which includes fixes for resource limits and settings, that could also be the restart loop you’re in.

Hi, still getting

Error: serviceaccounts “consul-tls-init” already exists

|NAME |CHART VERSION|APP VERSION|DESCRIPTION ||—|---|—|---|
|hashicorp/consul|0.24.1 |1.8.2 |Official HashiCorp Consul Chart|

I removed the image and imagek8s entries; is there a way I can tell I’m using the latest consul-helm?

No luck with these changes; any luck on your end?

Hi @joshua-burns - so far no luck on my end, I’m unable to get it into a failed state, here I’m using the following yaml file for helm :

$ cat x.yaml
global:
  name: kyle-consul
server:
  replicas: 1
  bootstrapExpect: 1
tls:
  enabled: true
connectInject:
  enabled: true
meshGateway:
  replicas: 1
  enabled: true
federation:
  enabled: true
datacenter: dc1

And I’m not seeing any issues with startup or related to the tls-init issue you mentioned.

$ kubectl get pods
NAME                                                              READY   STATUS    RESTARTS   AGE
kyle-consul-connect-injector-webhook-deployment-6886d5898dm6hl9   1/1     Running   0          2m37s
kyle-consul-mesh-gateway-94f997ff9-jzlnj                          2/2     Running   0          2m37s
kyle-consul-mkpjz                                                 1/1     Running   0          2m37s
kyle-consul-server-0                                              1/1     Running   0          2m37s

This was a fresh EKS cluster that I provisioned just now using the example from EKS’s docs.

eksctl create cluster \
--name kyle-eks-test \
--version 1.16 \
--region us-east-2 \
--nodegroup-name linux-nodes \
--node-type t3.medium \
--nodes 1 \
--nodes-min 1 \
--nodes-max 3 \
--ssh-access \
--managed

The only thing I see is a brief crash from mesh-gateway as it starts up, but this clears once the consul-server is online a second or two later and everything comes online cleanly.
EDIT: I should say that I cleaned it up using the commands I provided earlier and reinstalled and that also worked.

Hi Kyle,

A few notes:

  1. This is on Azure, but AFAIK the provisioning steps are analogous; I have successfully deployed other consul images previously.

  2. Your yaml values are different than on the documentation page. I ran those instead, and got an error (“timed out waiting for the condition”), and now see this:

consul-connect-injector-webhook-deployment-785cc55456-gsfhg 1/1 Running 0 5m23s
consul-mesh-gateway-964bdfc75-sq8p4 2/2 Running 0 5m23s
consul-r595r 1/1 Running 0 5m23s
consul-server-0 1/1 Running 0 5m22s
consul-test 0/1 Completed 0 5m22s

BUT, helm reports “failed”:

consul default 1 2020-08-27 16:30:29.897648373 -0400 EDT failed consul-0.24.1 1.8.2