Unhealthy pods are not removed from consul on k8s

nflaig · December 4, 2020, 10:22am

Hi,

we are using the latest version of the consul helm chart with service health checks enabled but still unhealthy or even non-existing services/pods are still shown in the consul UI. The Serf Health Status is passing on pods that do not even exist. Maybe it is a misconfiguration on our side but I can’t seem to find the issue.

Any ideas how to resolve this issue?

Thanks and best regards,
Nico

kschoche · December 9, 2020, 5:24pm

Hi Nico, thank you for raising this question!

Could you expand a bit more on your configuration and your expectations?

The current design of the health checks system will only address connect-injected pods, and the way it works is by registering a (new) health check with Consul which reflects the k8s readiness status for that pod.
When a k8s readiness probe fails for that pod the (new) health check in Consul will be marked failed (critical), which will subsequently cause the service instance’s health check to be marked critical. This will not modify the serf health check which could still be passing, however since the service instance’s health is critical it will no longer participate in service mesh traffic for that service.

There is a known issue where rarely a pod which is terminated by k8s in a manner by which it will not execute it’s preStop condition which is where we deregister the service and the health check, this could leave stray services in your UI but it is a pretty rare corner case.

The health checking system will not in any circumstance remove or restart the pod, but you could enforce this behaviour by using a liveness probe in k8s for your pods.

I hope this helps clear up things a bit, do let me know if you have other questions!

nflaig · December 9, 2020, 6:53pm

Hi @kschoche,

thanks for the elaborate answer.

To be honest the consul configuration is pretty basic, I don’t think anything there causes the problem.

My expectation is that if a pod does not even exist it should be removed from the UI, at least after a few days.

I think the services are not deregistered correctly but I am not sure why it happens, could this be caused by a to low termination grace period (10-15s)?

Also as a side notes, this issue happens when pods are updated via helm upgrade, consul adds the new pods/services but does not remove the old ones. What is interesting though is that this did not cause any issues, maybe just a UI bug?

kschoche · December 10, 2020, 7:18pm

Hi @nflaig!

I believe what you’re referencing is probably a bug with the pods not going away in the UI.
If you have exact steps to repro I’d be glad to, or would recommend you open up a bug with consul-k8s and we’ll get started on it.
I believe I’ve seen other reports of a similar issue but we have yet to reproduce it in house, so having exact steps would be very helpful!

cheers

nflaig · December 11, 2020, 9:01am

Hi @kschoche

I will further look into the issue but I am not sure myself how to reproduce it. We didn’t have the issue previously and the configuration did not change, I need to traceback what exactly changed and will let you know if I find something and create an issue on GitHub.

Topic		Replies	Views
LivenessProbe is unhealthy but Kubernetes Health Check still success? Consul health-check	4	2279	January 7, 2022
Service Discovery+Mesh healthcheck error Consul	1	284	December 20, 2021
How do I activate Health Check in services that register to Consul via HTTP API? Consul k8s , consul	0	38	September 4, 2024
Consul node is not deregistered Consul	5	5207	February 11, 2020
Unable to remove duplicate or stale instance entries of a service in Consul catalog when pod moves to different node Consul k8s , azure , consul	5	87	August 6, 2024

Unhealthy pods are not removed from consul on k8s

Related topics