[K8s] consul-controller pod leader election lost

Hello,

We use consul federation and service mesh on Kubernetes, we notice that we get some connections failure with Envoy, randomly.

Versions :
Consul 1.10.0
Envoy 1.17.4
Helm Chart 0.33

Before connections failure start, we can see some issue with cluster update on some pods via envoy_cluster_update_empty metric.

consul-server and agent pods don’t seem to have troubles, we don’t see any restart or specific errors.
However we detect restarts some times on consul-controller pod, with theses errors before :

k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.21.1/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew
	/go/pkg/mod/k8s.io/client-go@v0.21.1/tools/leaderelection/leaderelection.go:263
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/go/pkg/mod/k8s.io/client-go@v0.21.1/tools/leaderelection/leaderelection.go:208
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startLeaderElection.func3
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/manager/internal.go:682
2022-05-09T23:19:38.729Z	INFO	failed to renew lease consul/consul.hashicorp.com: timed out waiting for the condition

2022-05-09T23:19:38.729Z	INFO	controller.terminatinggateway	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "TerminatingGateway"}
2022-05-09T23:19:38.729Z	INFO	controller.servicedefaults	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceDefaults"}
2022-05-09T23:19:38.729Z	INFO	controller.servicesplitter	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceSplitter"}
2022-05-09T23:19:38.729Z	INFO	controller.servicerouter	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceRouter"}
2022-05-09T23:19:38.729Z	INFO	controller.serviceintentions	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceIntentions"}
2022-05-09T23:19:38.729Z	INFO	controller.serviceresolver	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceResolver"}
2022-05-09T23:19:38.729Z	INFO	controller.mesh	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "Mesh"}
2022-05-09T23:19:38.729Z	INFO	controller.ingressgateway	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "IngressGateway"}
2022-05-09T23:19:38.729Z	INFO	controller.proxydefaults	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ProxyDefaults"}
2022-05-09T23:19:38.729Z	INFO	controller.serviceintentions	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceIntentions"}
2022-05-09T23:19:38.729Z	INFO	controller.ingressgateway	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "IngressGateway"}
2022-05-09T23:19:38.729Z	INFO	controller.servicesplitter	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceSplitter"}
2022-05-09T23:19:38.729Z	INFO	controller.servicerouter	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceRouter"}
2022-05-09T23:19:38.729Z	INFO	controller.mesh	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "Mesh"}
2022-05-09T23:19:38.729Z	INFO	controller.proxydefaults	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ProxyDefaults"}
2022-05-09T23:19:38.729Z	INFO	controller.terminatinggateway	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "TerminatingGateway"}
2022-05-09T23:19:38.729Z	INFO	controller.servicedefaults	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceDefaults"}
2022-05-09T23:19:38.729Z	INFO	controller.serviceresolver	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceResolver"}
2022-05-09T23:19:38.729Z	INFO	controller-runtime.webhook	shutting down webhook server
2022-05-09T23:19:38.730Z	ERROR	error received after stop sequence was engaged	{"error": "context canceled"}
2022-05-09T23:19:38.729Z	ERROR	setup	problem running manager	{"error": "leader election lost"}
github.com/hashicorp/consul-k8s/control-plane/subcommand/controller.(*Command).Run
	/home/circleci/project/project/control-plane/subcommand/controller/command.go:326
github.com/mitchellh/cli.(*CLI).Run
	/go/pkg/mod/github.com/mitchellh/cli@v1.1.0/cli.go:260
main.main
	/home/circleci/project/project/control-plane/main.go:17
runtime.main
	/usr/local/go/src/runtime/proc.go:225

And last states of this pod :

    State:          Running
      Started:      Tue, 10 May 2022 01:19:39 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sun, 08 May 2022 09:11:04 +0200
      Finished:     Tue, 10 May 2022 01:19:38 +0200

Do you know what is the cause of theses errors ? maybe it can have a link with ours envoy issues

Thanks

I don’t think the controller errors are related because the controller just manages custom resources, it doesn’t effect the envoy proxies directly.

Are there any logs from consul clients?