Zombie services not automatically deregistered

I am using Consul 1.16.0 on Azure Kubernetes service, without Nomad or helm charts. When my microservices restart, there is good chance that the old instance is not deregistered properly and remains as a zombie (ex. in the image below, all the services above are zombies while the last one is working). What I’m confused about is the inconsistency - when a pod restarts, the old instance is sometimes able to deregister from Consul, while other times it becomes a zombie.

This is not a problem with consul failing to detect that the service is in critical condition, since I can see the critical services in my consul logs:

2023-09-18T20:01:27.384Z [WARN]  agent: Check is now critical: check=service:sr-ms-core-38d7a764f727f988124fa9052bc53a3d
2023-09-18T20:01:27.384Z [WARN]  agent: Check is now critical: check=service:sr-ms-core-a44f3691468908a1a11661ecefe63234
2023-09-18T20:01:27.384Z [WARN]  agent: Check is now critical: check=service:sr-ms-core-cb65441139767511a9d527d0a2fb07d3
...

Currently, a manual workaround is to kubectl exec into the consul pods and run consul services deregister -id <service-id>, but this is very tedious and inefficient, so I’m looking for a better solution or workaround.

I was wondering if there is anything I’m missing in the Consul config in my manifests:

      containers:
        - name: consul
          image: docker.io/bitnami/consul:1.16.0
          imagePullPolicy: "IfNotPresent"
          securityContext:
            allowPrivilegeEscalation: false
            runAsNonRoot: true
            runAsUser: 1001
          ports:
            - name: http
              containerPort: 8500
            - name: rpc
              containerPort: 8400
            - name: serflan-tcp
              protocol: "TCP"
              containerPort: 8301
            - name: serflan-udp
              containerPort: 8301
              protocol: "UDP"
            - name: rpc-server
              containerPort: 8300
            - name: dns-tcp
              containerPort: 8600
            - name: dns-udp
              containerPort: 8600
              protocol: "UDP"
          resources:
            requests:
              cpu: "100m"
              memory: "512Mi"
          env:
            - name: BITNAMI_DEBUG
              value: "false"
            - name: CONSUL_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: CONSUL_RETRY_JOIN
              value: "consul-headless.default.svc.cluster.local"
            - name: CONSUL_DISABLE_KEYRING_FILE
              value: "true"
            - name: CONSUL_BOOTSTRAP_EXPECT
              value: "3"
            - name: CONSUL_RAFT_MULTIPLIER
              value: "1"
            - name: CONSUL_DOMAIN
              value: "consul"
            - name: CONSUL_DATACENTER
              value: "dc1"
            - name: CONSUL_UI
              value: "true"
            - name: CONSUL_HTTP_PORT_NUMBER
              value: "8500"
            - name: CONSUL_DNS_PORT_NUMBER
              value: "8600"
            - name: CONSUL_RPC_PORT_NUMBER
              value: "8400"
            - name: CONSUL_SERF_LAN_PORT_NUMBER
              value: "8301"
          envFrom:
          livenessProbe:
            exec:
              command:
                - consul
                - operator
                - raft
                - list-peers
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            successThreshold: 1
            failureThreshold: 6
          readinessProbe:
            exec:
              command:
                - consul
                - members
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 5
            successThreshold: 1
            failureThreshold: 6
          lifecycle:
            preStop:
              exec:
                command:
                  - consul
                  - leave
          volumeMounts:
            - name: data
              mountPath: /bitnami/consul

I have also heard about the deregister_critical_service_after config option, but I’m not sure how to apply it in my Kubernetes manifests. (However, I would prefer a more elegant solution if it exists, because this option requires the critical service to exist for a certain period of time before being deregistered.)

1 Like

Hi @ML72,

Welcome to the HashiCorp Forums!

Could you please share why you are not using the Consul-K8S Helm Chart for setting up the cluster?

Considering that you are not using the Official Helm Chart, providing more details on your service registration and deregistration workflow will help better understand the issue.

If you want Consul to deregister the critical services, as you rightly identified, deregister_critical_service_after will be the right option to use.

Hello @Ranjandas !

I am not using the Consul-K8S Helm Chart because my microservices are spring boot applications generated by JHipster, and I used the JHipster kubernetes subgenerator to initialize my kubernetes manifests. By default, consul configuration was provided without Helm, and I simply went along with it.

Regarding my service registration and deregistration workflow, my microservices provide health check endpoints that consul calls to determine the status of services. The actual registration/deregistration process is the default provided by Consul, as I haven’t made any configuration changes for that.

If I were to use deregister_critical_service_after, how would I add it to my manifests?

Hi @ML72,

Unfortunately, I am not familiar with either of them. But I did a quick search and looks like you will have to set spring.cloud.consul.discovery.health-check-critical-timeout in the Spring Cloud configuration.

spring.cloud.consul.discovery.health-check-critical-timeout: Timeout to deregister services critical for longer than timeout (e.g. 30m).
ref: https://docs.spring.io/spring-cloud-consul/docs/current/reference/html/appendix.html

I hope you know how to configure the above in your Spring Cloud setup and that this helps.

1 Like

Thank you, I tried that and it seems to work equivalently to deregister_critical_service_after!

1 Like

For me, I am unable to deregister the service using the command as well. Able to see the service in the catalogue but unable to delete it

Error deregistering service “”: Unexpected response code: 404 (Unknown service ID “test-service”. Ensure that the service ID is passed, not the service name.)

It seems to be service in orphaned and unable to remove it

How to handle this situation?

Finally was able to remove the zombie services following are the steps I followed

  1. Login to the consul agent pod that runs in the same node as that of the consul server
  2. Execute the following command curl -H "Authorization: Bearer <TOKEN>" --request PUT http://<consul-server-ip>:8500/v1/agent/service/deregister/serviceId

Was able to figure it out with the help of this script Unable to deregister a service · Issue #1188 · hashicorp/consul · GitHub

Running into a similar issue again now with agentless deployment Unable to remove duplicate or stale instance entries of a service in Consul catalog when Consul Connect inject enabled pod moves from one node to another. Currently running with an agentless setup. · Issue #4219 · hashicorp/consul-k8s · GitHub

Apart from deleting the node that doesn’t exist anymore what helps me is scaling down the impacted service to zero and scaling it back up again which removes the duplicate or bad entries. These are just temporary fixes

Below is how I can consistently able to reproduce the issue

  1. pod A running in node A.
  2. cordon node A
  3. Let the pod A schedule in node B
  4. It leaves 2 entries of an instance in the consul catalog meaning the pod IP of both old and new pod A. Where the health of new pod A flips between healthy and unhealthy and old pod A entry is always unhealthy

This is Crazy