All workload pods stuck in CrashLoopBackoff after installing the hashicorp/consul chart in TKGi based kubernetes

drfooser · November 25, 2020, 2:19pm

This is a single node/host cluster. Why are the consul-connect-inject-init containers in workload pods trying to register to $HOST_IP:8500 ? They are getting “Connection refused” and all are stuck in Init:CrashLoopBackOff. The consul-connect-inject-init container is logging message “Error registering service: Put “http://10.115.3.5:8500/v1/agent/service/register”: dial tcp 10.115.3.5:8500: connect: connection refused.”

10.115.3.5 is the the IP of the k8s cluster host node.
All helm chart resources are installed and running.

mariusehr1 · November 25, 2020, 2:41pm

Hi

Did you use the following annotation for the servers’ bootstrap? If not , default value is 3 replicas for the daemonset and the bootstrapexpect is 3 so the raft cannot be created since your last two pods will be pending :

server:
  replicas: 2
  connect: true
  service:
    enabled : true
  bootstrapExpect: 2

Hope it helps

blake · November 25, 2020, 9:45pm

Hi @drfooser,

The consul-connect-inject-init container is trying to connect to the local client agent so that it can register itself with Consul. See the architecture section of Installing Consul on Kubernetes for more info about the client and server agents which are deployed.

Can you check the Consul client logs with the following command to see if there are errors, or any other indication as to why it may be not listening on port 8500?

kubectl logs --selector="app=consul,component=client"

drfooser · December 1, 2020, 8:59pm

Blake, thanks for your help. See my notes below,

After helm install all pods are running - client logs state…

dillon FMY >kubectl logs --selector=“app=consul,component=client”

2020-12-01T20:09:53.330Z [WARN] agent.router.manager: No servers available

2020-12-01T20:09:53.330Z [ERROR] agent.http: Request error: method=GET url=/v1/status/leader from=127.0.0.1:50720 error=“No known Consul servers”

2020-12-01T20:09:54.390Z [WARN] agent.router.manager: No servers available

2020-12-01T20:09:54.390Z [ERROR] agent.anti_entropy: failed to sync remote state: error=“No known Consul servers”

2020-12-01T20:09:58.806Z [INFO] agent: (LAN) joining: lan_addresses=[consul-server-0.consul-server.default.svc]

2020-12-01T20:09:58.820Z [INFO] agent.client.serf.lan: serf: EventMemberJoin: consul-server-0 10.114.25.12

2020-12-01T20:09:58.820Z [INFO] agent: (LAN) joined: number_of_nodes=1

2020-12-01T20:09:58.820Z [INFO] agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=1

2020-12-01T20:09:58.820Z [INFO] agent.client: adding server: server=“consul-server-0 (Addr: tcp/10.114.25.12:8300) (DC: dc1)”

2020-12-01T20:10:00.863Z [INFO] agent: Synced node info

Then I install a simple utility deployment …

dillon FMY >k apply -f ~/k8s-yaml/util-deployment-sidecar.yaml
deployment.apps/util-sidecar created

Deployment is failing at init

dillon FMY >k get pods

NAME READY STATUS RESTARTS AGE

consul-676vm 1/1 Running 0 32m
consul-connect-injector-webhook-deployment-6ddc4cfc85-qztf4 1/1 Running 0 32m
consul-controller-5d887d5bf-6ml9b 1/1 Running 0 32m
consul-server-0 1/1 Running 0 32m
consul-webhook-cert-manager-5d588db7bb-jz7lv 1/1 Running 0 32m
util-nosidecar-788df87b75-s6l2p 1/1 Running 0 8d
util-sidecar-5f98688568-7jjb6 0/3 Init:Error 1 10s

consul connect inject init container logs state…

dillon FMY >k logs util-sidecar-5f98688568-7jjb6 -c consul-connect-inject-init
Error registering service “util”: Put “http://10.115.3.5:8500/v1/agent/service/register”: dial tcp 10.115.3.5:8500: connect: connection refused

All other container logs state…

dillon FMY >k logs util-sidecar-5f98688568-7jjb6 -c consul-connect-lifecycle-sidecar
Error from server (BadRequest): container “consul-connect-lifecycle-sidecar” in pod “util-sidecar-5f98688568-7jjb6” is waiting to start: PodInitializing
dillon FMY >k logs util-sidecar-5f98688568-7jjb6 -c consul-connect-envoy-sidecar
Error from server (BadRequest): container “consul-connect-envoy-sidecar” in pod “util-sidecar-5f98688568-7jjb6” is waiting to start: PodInitializing
dillon FMY >k logs util-sidecar-5f98688568-7jjb6 -c util
Error from server (BadRequest): container “util” in pod “util-sidecar-5f98688568-7jjb6” is waiting to start: PodInitializing

Installed services…

dillon FMY >k get services -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default consul-connect-injector-svc ClusterIP 10.100.200.15 443/TCP 41m
default consul-controller-webhook ClusterIP 10.100.200.40 443/TCP 41m
default consul-dns ClusterIP 10.100.200.35 53/TCP,53/UDP 41m
default consul-server ClusterIP None 8500/TCP,8301/TCP,8301/UDP,8302/TCP,8302/UDP,8300/TCP,8600/TCP,8600/UDP 41m
default consul-ui ClusterIP 10.100.200.116 80/TCP 41m
default frontend ClusterIP 10.100.200.208 80/TCP 7d22h
default kubernetes ClusterIP 10.100.200.1 443/TCP 11d
default postgres ClusterIP 10.100.200.83 5432/TCP 7d22h
default product-api ClusterIP 10.100.200.161 9090/TCP 7d22h
default public-api ClusterIP 10.100.200.68 8080/TCP 7d22h
kube-system kube-dns ClusterIP 10.100.200.2 53/UDP,53/TCP 11d
kube-system metrics-server ClusterIP 10.100.200.205 443/TCP 11d
kube-system tiller-deploy ClusterIP 10.100.200.149 44134/TCP 11d
pks-system fluent-bit ClusterIP 10.100.200.73 24224/TCP 11d
pks-system node-exporter ClusterIP 10.100.200.123 10536/TCP 11d
pks-system validator ClusterIP 10.100.200.4 443/TCP 11d
dillon FMY >

I’m stuck!

blake · December 6, 2020, 11:05am

@drfooser,

The Consul client (which runs as a DaemonSet and uses hostPort) does not appear to be reachable from node where pod is running.

Can you verify this pod is listening on port 8500, and that it is responding to HTTP requests issued from within the pod?

drfooser · December 7, 2020, 4:13pm

Thanks Blake,
This reveals the problem. We are using NSX-T which does not support hostPort. It does support nodePort.

I am thinking that I can try creating a nodePort service for the clients daemonset and point the init containers to that.
Can you share your thoughts about that?
Do you have a better suggestion?

drfooser · December 7, 2020, 9:50pm

Hello Blake,

As I stated, NSX-T does not support hostPort. It does support nodePort.

I can build a nodePort service to expose a port on the node and proxy traffic to the clients, but the default port range for nodePort specified by the --service-node-port-range flag at the K8s control plane is 30000-32767 and changing that to include 8500 might not be permitted. So instead I’d like to change the port that the consul-connect-inject-init container sends to register with the client agent.

Looking through the helm chart templates, I’m having a hard time understanding how to change that port for the init containers.

Can you help?

Regards, and thanks for your assistance.

Paul

drfooser · December 9, 2020, 8:02pm

Blake,

I finally settled on the nodeNetwork configuration option. That exposed 8500 on the nodes IP.
Including 8500 in the service-node-port-range seemed like a stretch. So we abandon the idea of nodePort.

Thanks,
Paul

lkysow · December 10, 2020, 6:51pm

Glad to hear it! NodePort wouldn’t work because the request could go to any consul client but instead it must go to the consul client on that node.

drfooser · December 11, 2020, 3:03pm

Well I’d like your opinion on a work-around for that - because I’m getting trouble from security about using hostNetwork.

If I create a nodePort service for each individual client selector targeting only one client, seems like I can theoretically link all proxys on node A to the client on node A through the nodePort service with labels and selector fields set to node A.

Thoughts?

david-yu · May 20, 2021, 4:43pm

Hi @drfooser I was looking at some of the NSX-T CNI release notes and it looks like if you are on Ubuntu, enabling HostPort is an option for you on NSX-T. However this issue is still outstanding on RHEL/CentOS, are you also on RHEL/CentOS by chance?

Issue 2697547: HostPort not supported on RHEL/CentOS/RHCOS nodes
You can specify hostPorts on native Kubernetes and PKS on Ubuntu nodes by setting ‘enable_hostport_snat’ to True in nsx-node-agent ConfigMap. However, on RHEL/CentOS/RHCOS nodes hostPort is not supported and the parameter ‘enable_hostport_snat’ is ignored.
Workaround: None

Topic		Replies	Views
Consul helm chart fails in a Tanzu kubernetes 1.17 env Consul	2	1025	November 23, 2020
Consul Connect failing to register service Consul connect , consul-enterprise	3	4670	January 21, 2022
Consul pods are failing to run Consul	5	1982	January 20, 2022
Pod connect-injector-webhook-deployment in CrashLoopBackOff state Consul	9	2266	April 14, 2022
Consul clients not deployed on every kubernetes node Consul	7	1172	September 29, 2020

All workload pods stuck in CrashLoopBackoff after installing the hashicorp/consul chart in TKGi based kubernetes

Related topics