I am using Consul 1.16.0 on Azure Kubernetes service, without Nomad or helm charts. When my microservices restart, there is good chance that the old instance is not deregistered properly and remains as a zombie (ex. in the image below, all the services above are zombies while the last one is working). What I’m confused about is the inconsistency - when a pod restarts, the old instance is sometimes able to deregister from Consul, while other times it becomes a zombie.
This is not a problem with consul failing to detect that the service is in critical condition, since I can see the critical services in my consul logs:
2023-09-18T20:01:27.384Z [WARN] agent: Check is now critical: check=service:sr-ms-core-38d7a764f727f988124fa9052bc53a3d
2023-09-18T20:01:27.384Z [WARN] agent: Check is now critical: check=service:sr-ms-core-a44f3691468908a1a11661ecefe63234
2023-09-18T20:01:27.384Z [WARN] agent: Check is now critical: check=service:sr-ms-core-cb65441139767511a9d527d0a2fb07d3
...
Currently, a manual workaround is to kubectl exec
into the consul pods and run consul services deregister -id <service-id>
, but this is very tedious and inefficient, so I’m looking for a better solution or workaround.
I was wondering if there is anything I’m missing in the Consul config in my manifests:
containers:
- name: consul
image: docker.io/bitnami/consul:1.16.0
imagePullPolicy: "IfNotPresent"
securityContext:
allowPrivilegeEscalation: false
runAsNonRoot: true
runAsUser: 1001
ports:
- name: http
containerPort: 8500
- name: rpc
containerPort: 8400
- name: serflan-tcp
protocol: "TCP"
containerPort: 8301
- name: serflan-udp
containerPort: 8301
protocol: "UDP"
- name: rpc-server
containerPort: 8300
- name: dns-tcp
containerPort: 8600
- name: dns-udp
containerPort: 8600
protocol: "UDP"
resources:
requests:
cpu: "100m"
memory: "512Mi"
env:
- name: BITNAMI_DEBUG
value: "false"
- name: CONSUL_NODE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: CONSUL_RETRY_JOIN
value: "consul-headless.default.svc.cluster.local"
- name: CONSUL_DISABLE_KEYRING_FILE
value: "true"
- name: CONSUL_BOOTSTRAP_EXPECT
value: "3"
- name: CONSUL_RAFT_MULTIPLIER
value: "1"
- name: CONSUL_DOMAIN
value: "consul"
- name: CONSUL_DATACENTER
value: "dc1"
- name: CONSUL_UI
value: "true"
- name: CONSUL_HTTP_PORT_NUMBER
value: "8500"
- name: CONSUL_DNS_PORT_NUMBER
value: "8600"
- name: CONSUL_RPC_PORT_NUMBER
value: "8400"
- name: CONSUL_SERF_LAN_PORT_NUMBER
value: "8301"
envFrom:
livenessProbe:
exec:
command:
- consul
- operator
- raft
- list-peers
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 6
readinessProbe:
exec:
command:
- consul
- members
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 6
lifecycle:
preStop:
exec:
command:
- consul
- leave
volumeMounts:
- name: data
mountPath: /bitnami/consul
I have also heard about the deregister_critical_service_after
config option, but I’m not sure how to apply it in my Kubernetes manifests. (However, I would prefer a more elegant solution if it exists, because this option requires the critical service to exist for a certain period of time before being deregistered.)