Unable to remove duplicate or stale instance entries of a service in Consul catalog when pod moves to different node

After upgrading the consul from version 1.14.10 to 1.16.6 using an agentless setup, I noticed duplicate entries of the same instances under service. I am unable to remove them, and one of the entries is orphaned, pointing to a pod that is no longer running in the cluster.

How to resolve it?

Apart from deleting the node that doesn’t exist anymore what helps me is scaling down the impacted service to zero and scaling it back up again which removes the duplicate or bad entries. These are just temporary fixes

Below is how I can consistently able to reproduce the issue

  1. pod A running in node A.
  2. cordon node A
  3. Let the pod A schedule in node B
  4. It leaves 2 entries of an instance in the consul catalog meaning the pod IP of both old and new pod A. Where the health of new pod A flips between healthy and unhealthy and old pod A entry is always unhealthy

This is Crazy

Hi @magesh.srinivasulu,

What version of Consul-K8S are you running?

The Endpoints Controller registers and deregisters services into Consul when pods (endpoints) are created and torn down. It runs inside the connect-injector pod, a Consul-K8S component. The connect-injector pod logs can tell you why the deregistration failed.

@Ranjandas I am using this image consul-k8s-control-plane:1.2.9

Will check the connect-inject pod logs

@Ranjandas Below is what I found in the connect inject logs. Just masked the actual service name. The only change that we have made is making the consul agentless.

2024-08-01T01:06:32.083Z ERROR controller.endpoints failed to deregister endpoints {“name”: “SERVICE”, “ns”: “NAMESPACE”, “error”: “2 errors occurred:\n\t* failed to update service health status for pod NAMESPACE/POD to critical: Unexpected response code: 500 (rpc error making call: Unknown service ID ‘SERVICE ID’ for check ID ‘NAMESPACE/SERVICE ID’)\n\t* failed to update service health status for pod NAMESPACE/POD to critical: Unexpected response code: 500 (rpc error making call: Unknown service ID ‘SERVICE ID-sidecar-proxy’ for check ID ‘NAMESPACE/SERVICE ID-sidecar-proxy’)\n\n”

What does this mean?

Unknown service ID ‘SERVICE ID’ for check ID ‘NAMESPACE/SERVICE ID’

@Ranjandas This is how I found the working version of consul when trying to upgrade from 1.14.10 to 1.16.6

The nearest working version is 1.15.9. All the versions from 1.15.10 to 1.16.6 have one issue or another it is not stable and results are not consistent

The issue mentioned below is predominant in the 1.16 release

hashicorp/consul#19717