I am using Consul 1.16.0 on Azure Kubernetes service, without Nomad or helm charts. When my microservices restart, there is good chance that the old instance is not deregistered properly and remains as a zombie (ex. in the image below, all the services above are zombies while the last one is working). What I’m confused about is the inconsistency - when a pod restarts, the old instance is sometimes able to deregister from Consul, while other times it becomes a zombie.
This is not a problem with consul failing to detect that the service is in critical condition, since I can see the critical services in my consul logs:
2023-09-18T20:01:27.384Z [WARN] agent: Check is now critical: check=service:sr-ms-core-38d7a764f727f988124fa9052bc53a3d 2023-09-18T20:01:27.384Z [WARN] agent: Check is now critical: check=service:sr-ms-core-a44f3691468908a1a11661ecefe63234 2023-09-18T20:01:27.384Z [WARN] agent: Check is now critical: check=service:sr-ms-core-cb65441139767511a9d527d0a2fb07d3 ...
Currently, a manual workaround is to
kubectl exec into the consul pods and run
consul services deregister -id <service-id>, but this is very tedious and inefficient, so I’m looking for a better solution or workaround.
I was wondering if there is anything I’m missing in the Consul config in my manifests:
containers: - name: consul image: docker.io/bitnami/consul:1.16.0 imagePullPolicy: "IfNotPresent" securityContext: allowPrivilegeEscalation: false runAsNonRoot: true runAsUser: 1001 ports: - name: http containerPort: 8500 - name: rpc containerPort: 8400 - name: serflan-tcp protocol: "TCP" containerPort: 8301 - name: serflan-udp containerPort: 8301 protocol: "UDP" - name: rpc-server containerPort: 8300 - name: dns-tcp containerPort: 8600 - name: dns-udp containerPort: 8600 protocol: "UDP" resources: requests: cpu: "100m" memory: "512Mi" env: - name: BITNAMI_DEBUG value: "false" - name: CONSUL_NODE_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: CONSUL_RETRY_JOIN value: "consul-headless.default.svc.cluster.local" - name: CONSUL_DISABLE_KEYRING_FILE value: "true" - name: CONSUL_BOOTSTRAP_EXPECT value: "3" - name: CONSUL_RAFT_MULTIPLIER value: "1" - name: CONSUL_DOMAIN value: "consul" - name: CONSUL_DATACENTER value: "dc1" - name: CONSUL_UI value: "true" - name: CONSUL_HTTP_PORT_NUMBER value: "8500" - name: CONSUL_DNS_PORT_NUMBER value: "8600" - name: CONSUL_RPC_PORT_NUMBER value: "8400" - name: CONSUL_SERF_LAN_PORT_NUMBER value: "8301" envFrom: livenessProbe: exec: command: - consul - operator - raft - list-peers initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 successThreshold: 1 failureThreshold: 6 readinessProbe: exec: command: - consul - members initialDelaySeconds: 5 periodSeconds: 10 timeoutSeconds: 5 successThreshold: 1 failureThreshold: 6 lifecycle: preStop: exec: command: - consul - leave volumeMounts: - name: data mountPath: /bitnami/consul
I have also heard about the
deregister_critical_service_after config option, but I’m not sure how to apply it in my Kubernetes manifests. (However, I would prefer a more elegant solution if it exists, because this option requires the critical service to exist for a certain period of time before being deregistered.)