Unhealthy pods are not removed from consul on k8s

Hi,

we are using the latest version of the consul helm chart with service health checks enabled but still unhealthy or even non-existing services/pods are still shown in the consul UI. The Serf Health Status is passing on pods that do not even exist. Maybe it is a misconfiguration on our side but I can’t seem to find the issue.

Any ideas how to resolve this issue?

Thanks and best regards,
Nico

Hi Nico, thank you for raising this question!

Could you expand a bit more on your configuration and your expectations?

The current design of the health checks system will only address connect-injected pods, and the way it works is by registering a (new) health check with Consul which reflects the k8s readiness status for that pod.
When a k8s readiness probe fails for that pod the (new) health check in Consul will be marked failed (critical), which will subsequently cause the service instance’s health check to be marked critical. This will not modify the serf health check which could still be passing, however since the service instance’s health is critical it will no longer participate in service mesh traffic for that service.

There is a known issue where rarely a pod which is terminated by k8s in a manner by which it will not execute it’s preStop condition which is where we deregister the service and the health check, this could leave stray services in your UI but it is a pretty rare corner case.

The health checking system will not in any circumstance remove or restart the pod, but you could enforce this behaviour by using a liveness probe in k8s for your pods.

I hope this helps clear up things a bit, do let me know if you have other questions!

Hi @kschoche1,

thanks for the elaborate answer.

To be honest the consul configuration is pretty basic, I don’t think anything there causes the problem.

My expectation is that if a pod does not even exist it should be removed from the UI, at least after a few days.

I think the services are not deregistered correctly but I am not sure why it happens, could this be caused by a to low termination grace period (10-15s)?

Also as a side notes, this issue happens when pods are updated via helm upgrade, consul adds the new pods/services but does not remove the old ones. What is interesting though is that this did not cause any issues, maybe just a UI bug?

Hi @nflaig!

I believe what you’re referencing is probably a bug with the pods not going away in the UI.
If you have exact steps to repro I’d be glad to, or would recommend you open up a bug with consul-k8s and we’ll get started on it.
I believe I’ve seen other reports of a similar issue but we have yet to reproduce it in house, so having exact steps would be very helpful!

cheers

Hi @kschoche1

I will further look into the issue but I am not sure myself how to reproduce it. We didn’t have the issue previously and the configuration did not change, I need to traceback what exactly changed and will let you know if I find something and create an issue on GitHub.