Pod connect-injector-webhook-deployment in CrashLoopBackOff state

zvik · July 28, 2021, 4:30am

Ahoy community,
I have an issue that I had seen happened in the past to others , but could not find a solution to it in my env.
Story: I have a new installation of Docker (20.10.7) and k8s (v1.21.3) on centos (release 7.9.2009). helm is “v3.6.3”.
This is a one node k8s cluster:
[root@home3vm3 deploy]# k get nodes
NAME STATUS ROLES AGE VERSION
home3vm3 Ready control-plane,master 5h10m v1.21.3
I try to run the procedure “Getting Started with Consul Service Mesh for Kubernetes” (Getting Started with Consul Service Mesh for Kubernetes | Consul - HashiCorp Learn) and all pods are in running except for pod consul-connect-injector-webhook-deployment that shows state of CrashLoopBackOff.
k get pods:
NAME READY STATUS RESTARTS AGE
consul-b5jv8 1/1 Running 0 3h21m
consul-connect-injector-webhook-deployment-77b574c5cc-mkw9s 0/1 CrashLoopBackOff 76 3h7m
consul-controller-5788b8f6c7-khs5f 1/1 Running 0 3h21m
consul-server-0 1/1 Running 0 3h21m
consul-webhook-cert-manager-5745cbb9d-ztpnv 1/1 Running 0 3h21m

Logs for this pod:
[root@home3vm3 deploy]# k logs consul-connect-injector-webhook-deployment-77b574c5cc-mkw9s
Listening on “:8080”…
Error loading TLS keypair: tls: failed to find any PEM data in certificate input
2021/07/28 04:22:37 http: TLS handshake error from 10.244.0.1:39872: No certificate available.
Error loading TLS keypair: tls: failed to find any PEM data in certificate input
2021/07/28 04:22:37 http: TLS handshake error from 10.244.0.1:39870: No certificate available.
terminated received, shutting down
Error listening: http: Server closed
E0728 04:22:38.066550 1 controller.go:124] error syncing cache
E0728 04:22:38.066560 1 controller.go:124] error syncing cache
2021-07-28T04:22:38.165Z [ERROR] healthCheckResource: unable to get pods: err=“Get “https://10.96.0.1:443/api/v1/pods?labelSelector=consul.hashicorp.com%2Fconnect-inject-status”: context canceled”
2021-07-28T04:22:38.165Z [INFO] healthCheckResource: received stop signal, shutting down
2021-07-28T04:22:38.265Z [ERROR] cleanupResource: unable to get nodes: error=“Get “https://10.96.0.1:443/api/v1/nodes”: context canceled”
2021-07-28T04:22:38.265Z [INFO] cleanupResource: received stop signal, shutting down

Describe for this pod:
[root@home3vm3 deploy]# k describe pod consul-connect-injector-webhook-deployment-77b574c5cc-mkw9s
Name: consul-connect-injector-webhook-deployment-77b574c5cc-mkw9s
Namespace: default
Priority: 0
Node: home3vm3/192.168.2.246
Start Time: Tue, 27 Jul 2021 18:13:46 -0700
Labels: app=consul
chart=consul-helm
component=connect-injector
pod-template-hash=77b574c5cc
release=consul
Annotations: consul.hashicorp.com/connect-inject: false
Status: Running
IP: 10.244.0.18
IPs:
IP: 10.244.0.18
Controlled By: ReplicaSet/consul-connect-injector-webhook-deployment-77b574c5cc
Containers:
sidecar-injector:
Container ID: docker://19540028abadc3e387087618f5604bb1800f08b985510b150f0e7b9dd75cf600
Image: hashicorp/consul-k8s:0.25.0
Image ID: docker-pullable://hashicorp/consul-k8s@sha256:66a1dfd964e9a8fe2477803462fd08cb83744a65f2b8083e1c51c580f6930c7d
Port:
Host Port:
Command:
/bin/sh
-ec
CONSUL_FULLNAME=“consul”

  consul-k8s inject-connect \
    -default-inject=true \
    -consul-image="hashicorp/consul:1.9.7" \
    -envoy-image="envoyproxy/envoy:v1.16.4" \
    -consul-k8s-image="hashicorp/consul-k8s:0.25.0" \
    -listen=:8080 \
    -log-level=info \
    -enable-health-checks-controller=true \
    -health-checks-reconcile-period=1m \
    -cleanup-controller-reconcile-period=5m \
    -default-enable-metrics=false \
    -default-enable-metrics-merging=false  \
    -default-merged-metrics-port=20100 \
    -default-prometheus-scrape-port=20200 \
    -default-prometheus-scrape-path="/metrics" \
    -allow-k8s-namespace="*" \
    -tls-auto=${CONSUL_FULLNAME}-connect-injector-cfg \
    -tls-auto-hosts=${CONSUL_FULLNAME}-connect-injector-svc,${CONSUL_FULLNAME}-connect-injector-svc.${NAMESPACE},${CONSUL_FULLNAME}-connect-injector-svc.${NAMESPACE}.svc \
    -init-container-memory-limit=150Mi \
    -init-container-memory-request=25Mi \
    -init-container-cpu-limit=50m \
    -init-container-cpu-request=50m \
    -consul-sidecar-memory-limit=50Mi \
    -consul-sidecar-memory-request=25Mi \
    -consul-sidecar-cpu-limit=20m \
    -consul-sidecar-cpu-request=20m \

State:          Waiting
  Reason:       CrashLoopBackOff
Last State:     Terminated
  Reason:       Completed
  Exit Code:    0
  Started:      Tue, 27 Jul 2021 21:22:31 -0700
  Finished:     Tue, 27 Jul 2021 21:22:38 -0700
Ready:          False
Restart Count:  78
Limits:
  cpu:     50m
  memory:  50Mi
Requests:
  cpu:      50m
  memory:   50Mi
Liveness:   http-get https://:8080/health/ready delay=1s timeout=5s period=2s #success=1 #failure=2
Readiness:  http-get https://:8080/health/ready delay=2s timeout=5s period=2s #success=1 #failure=2
Environment:
  NAMESPACE:         default (v1:metadata.namespace)
  HOST_IP:            (v1:status.hostIP)
  CONSUL_HTTP_ADDR:  http://$(HOST_IP):8500
Mounts:
  /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-942t7 (ro)

Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-942t7:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message

Normal Started 50m (x60 over 3h9m) kubelet Started container sidecar-injector
Normal Pulled 44m (x61 over 3h10m) kubelet Container image “hashicorp/consul-k8s:0.25.0” already present on machine
Warning BackOff 4m50s (x844 over 3h9m) kubelet Back-off restarting failed container

I tried many things including changing the CNI (now it is flannel, but I had tried also with calico). I read in consul common errors page the issue probably is on CNI, but I could not figure out anything.
I will appreciate any tip on how to work on this, like how to enable more debug on this pod, or what other things to check.
Thank you,
zvik

zvik · August 4, 2021, 3:37am

Found the issue and solution, replying to myself as it may help others.
The problem is that the default value of readinessProbe initialDelaySeconds in chart is too low and the sidecar container keep being restarted.
Solution:
Increase the value of readinessProbe initialDelaySeconds from 2 to 10 or whatever value suits you. You may need to modify the same also for livenessProbe.

Example:
Template: connect-inject-deployment.yaml
readinessProbe:
httpGet:
path: /health/ready
port: 8080
scheme: HTTPS
failureThreshold: 2
initialDelaySeconds: 10 // I increased this one
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 8 // I increased this one

broadaxe · April 9, 2022, 1:00am

Hi, where do I find/modify these values? I am just starting with consul/helm, and I do not see there in any of the yaml files in the specified directory… I do not see/find a yaml file called connect-inject-deployment.yaml anywhere in the repo.
Thanks.

zvik · April 9, 2022, 2:34am

Hi broadaxe,
Following is a putty session where I download the consul chart (the latest), and unzip it and grep where in chart (in what files) we can find the initialDelaySeconds and readinessProbeInitialDelay:

helm repo add hashicorp https://helm.releases.hashicorp.com

“hashicorp” has been added to your repositories

ls

consul-0.36.0.tgz

tar zxf consul-0.36.0.tgz

cd consul

grep -R initialDelaySeconds *

templates/connect-inject-deployment.yaml: initialDelaySeconds: 1
templates/connect-inject-deployment.yaml: initialDelaySeconds: 2
templates/ingress-gateways-deployment.yaml: initialDelaySeconds: 30
templates/ingress-gateways-deployment.yaml: initialDelaySeconds: 10
templates/mesh-gateway-deployment.yaml: initialDelaySeconds: 30
templates/mesh-gateway-deployment.yaml: initialDelaySeconds: 10
templates/prometheus.yaml: initialDelaySeconds: 0
templates/prometheus.yaml: initialDelaySeconds: 30
templates/server-statefulset.yaml: initialDelaySeconds: 5
templates/sync-catalog-deployment.yaml: initialDelaySeconds: 30
templates/sync-catalog-deployment.yaml: initialDelaySeconds: 10
templates/terminating-gateways-deployment.yaml: initialDelaySeconds: 30
templates/terminating-gateways-deployment.yaml: initialDelaySeconds: 10

grep -R readinessProbe *

addons/values/prometheus.yaml: readinessProbeInitialDelay: 0
templates/client-daemonset.yaml: readinessProbe:
templates/connect-inject-deployment.yaml: readinessProbe:
templates/ingress-gateways-deployment.yaml: readinessProbe:
templates/mesh-gateway-deployment.yaml: readinessProbe:
templates/prometheus.yaml: readinessProbe:
templates/server-statefulset.yaml: readinessProbe:
templates/sync-catalog-deployment.yaml: readinessProbe:
templates/terminating-gateways-deployment.yaml: readinessProbe:

Note that when I posted while ago I used an older version, but I see same values are still in latest version.
I hope it helps.
//zvik

lkysow · April 11, 2022, 6:39pm

@broadaxe are you seeing crashloopbackoff’s for the connect injector? Things have changed a lot since the original issue.

broadaxe · April 11, 2022, 10:45pm

Yes, I am. I am following the tutorial at Getting Started with Consul Service Mesh for Kubernetes | Consul - HashiCorp Learn and I am having somewhat random problems now. I was apparently able to increase the values for the delay|timeout on the connect injector, but it is not necessarily effective 100%. Now also the clients are having “Readiness probe failed” timeouts and there is no apparent way to change their delay|timeout, which is set by default to 0 and 1 seconds. The server never runs because it is apparently waiting for all other pods to come up.
Following solutions found elsewhere on the Internet I have also increased the size of the Kubernetes cluster(from 3 to 4), and have removed the master node taint. My cluster is now 4 VM nodes with 2 vCPUs and 12G/RAM. They are running in CentOS 7 fully patched and with the latest versions of everything. This is a test cluster for my own studying, and is not running anything else, so it should have no issues. This example is supposed to even run on a laptop with Minicube for what I understand. I am running full open source docker CE and Kubernetes.
Any insight would be appreciated.

zvik · April 12, 2022, 1:02am

Broadaxe,
The consul server should not wait for any other pods. BUT, it uses a pvc. So maybe the issue is with storage? Are all the k8s nodes
in ready state?
Have a look in the log of the consul server, often times it is helpful, or you can post it here.
//zvik

zvik · April 12, 2022, 2:44am

I forgot, check if you have in your cluster a default storage class defined. Check it with kubectl get sc, you should see the string "(default) " in one of the storage classes.
If you don’t see it, then you need to patch your sc to be default, see:

broadaxe · April 12, 2022, 4:32pm

That , I am missing. I would need to come up with some storage and define a storage class

broadaxe · April 14, 2022, 6:36pm

I am happy to report that my issues went away once I provisioned storage and made it default. I then redeployed the Consul Service Mesh with the added timeouts for the connect injector and Consul came up green.
Thanks for the info.

Topic		Replies	Views
All workload pods stuck in CrashLoopBackoff after installing the hashicorp/consul chart in TKGi based kubernetes Consul k8s	10	2178	May 20, 2021
Consul helm chart fails in a Tanzu kubernetes 1.17 env Consul	2	973	November 23, 2020
Start Consul Injector failed on GKE Consul	1	681	September 16, 2023
Consul not starting with helm in K8 Client Pods: Reason: BadRequest (400) Consul k8s	1	1365	September 13, 2021
Cant deploy new pods after enable connectInject Consul	1	183	September 26, 2024