[K8s] consul-controller pod leader election lost

arkpoah · May 10, 2022, 10:26am

Hello,

We use consul federation and service mesh on Kubernetes, we notice that we get some connections failure with Envoy, randomly.

Versions :
Consul 1.10.0
Envoy 1.17.4
Helm Chart 0.33

Before connections failure start, we can see some issue with cluster update on some pods via envoy_cluster_update_empty metric.

consul-server and agent pods don’t seem to have troubles, we don’t see any restart or specific errors.
However we detect restarts some times on consul-controller pod, with theses errors before :

k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.21.1/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew
	/go/pkg/mod/k8s.io/client-go@v0.21.1/tools/leaderelection/leaderelection.go:263
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/go/pkg/mod/k8s.io/client-go@v0.21.1/tools/leaderelection/leaderelection.go:208
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startLeaderElection.func3
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/manager/internal.go:682
2022-05-09T23:19:38.729Z	INFO	failed to renew lease consul/consul.hashicorp.com: timed out waiting for the condition

2022-05-09T23:19:38.729Z	INFO	controller.terminatinggateway	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "TerminatingGateway"}
2022-05-09T23:19:38.729Z	INFO	controller.servicedefaults	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceDefaults"}
2022-05-09T23:19:38.729Z	INFO	controller.servicesplitter	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceSplitter"}
2022-05-09T23:19:38.729Z	INFO	controller.servicerouter	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceRouter"}
2022-05-09T23:19:38.729Z	INFO	controller.serviceintentions	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceIntentions"}
2022-05-09T23:19:38.729Z	INFO	controller.serviceresolver	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceResolver"}
2022-05-09T23:19:38.729Z	INFO	controller.mesh	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "Mesh"}
2022-05-09T23:19:38.729Z	INFO	controller.ingressgateway	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "IngressGateway"}
2022-05-09T23:19:38.729Z	INFO	controller.proxydefaults	Shutdown signal received, waiting for all workers to finish	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ProxyDefaults"}
2022-05-09T23:19:38.729Z	INFO	controller.serviceintentions	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceIntentions"}
2022-05-09T23:19:38.729Z	INFO	controller.ingressgateway	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "IngressGateway"}
2022-05-09T23:19:38.729Z	INFO	controller.servicesplitter	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceSplitter"}
2022-05-09T23:19:38.729Z	INFO	controller.servicerouter	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceRouter"}
2022-05-09T23:19:38.729Z	INFO	controller.mesh	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "Mesh"}
2022-05-09T23:19:38.729Z	INFO	controller.proxydefaults	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ProxyDefaults"}
2022-05-09T23:19:38.729Z	INFO	controller.terminatinggateway	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "TerminatingGateway"}
2022-05-09T23:19:38.729Z	INFO	controller.servicedefaults	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceDefaults"}
2022-05-09T23:19:38.729Z	INFO	controller.serviceresolver	All workers finished	{"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceResolver"}
2022-05-09T23:19:38.729Z	INFO	controller-runtime.webhook	shutting down webhook server
2022-05-09T23:19:38.730Z	ERROR	error received after stop sequence was engaged	{"error": "context canceled"}
2022-05-09T23:19:38.729Z	ERROR	setup	problem running manager	{"error": "leader election lost"}
github.com/hashicorp/consul-k8s/control-plane/subcommand/controller.(*Command).Run
	/home/circleci/project/project/control-plane/subcommand/controller/command.go:326
github.com/mitchellh/cli.(*CLI).Run
	/go/pkg/mod/github.com/mitchellh/cli@v1.1.0/cli.go:260
main.main
	/home/circleci/project/project/control-plane/main.go:17
runtime.main
	/usr/local/go/src/runtime/proc.go:225

And last states of this pod :

    State:          Running
      Started:      Tue, 10 May 2022 01:19:39 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sun, 08 May 2022 09:11:04 +0200
      Finished:     Tue, 10 May 2022 01:19:38 +0200

Do you know what is the cause of theses errors ? maybe it can have a link with ours envoy issues

Thanks

lkysow · May 11, 2022, 4:41pm

I don’t think the controller errors are related because the controller just manages custom resources, it doesn’t effect the envoy proxies directly.

Are there any logs from consul clients?

Topic		Replies	Views
Failed leadership election with three node cluster in GKE (Consul v1.5.2) Consul	4	404	February 20, 2023
Consul-server always restarts election and no cluster leader Consul k8s	0	358	October 12, 2021
Kubernetes Consul Cluster Outage Recovery Consul	0	245	August 31, 2022
Consul not electing new leader when blocking network traffic Consul	3	1504	December 20, 2021
Classic networking issues Consul k8s	4	811	December 8, 2020

[K8s] consul-controller pod leader election lost

Related topics