Hello,
We use consul federation and service mesh on Kubernetes, we notice that we get some connections failure with Envoy, randomly.
Versions :
Consul 1.10.0
Envoy 1.17.4
Helm Chart 0.33
Before connections failure start, we can see some issue with cluster update on some pods via envoy_cluster_update_empty metric.
consul-server and agent pods don’t seem to have troubles, we don’t see any restart or specific errors.
However we detect restarts some times on consul-controller pod, with theses errors before :
k8s.io/apimachinery/pkg/util/wait.Until
/go/pkg/mod/k8s.io/apimachinery@v0.21.1/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew
/go/pkg/mod/k8s.io/client-go@v0.21.1/tools/leaderelection/leaderelection.go:263
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
/go/pkg/mod/k8s.io/client-go@v0.21.1/tools/leaderelection/leaderelection.go:208
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startLeaderElection.func3
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/manager/internal.go:682
2022-05-09T23:19:38.729Z INFO failed to renew lease consul/consul.hashicorp.com: timed out waiting for the condition
2022-05-09T23:19:38.729Z INFO controller.terminatinggateway Shutdown signal received, waiting for all workers to finish {"reconciler group": "consul.hashicorp.com", "reconciler kind": "TerminatingGateway"}
2022-05-09T23:19:38.729Z INFO controller.servicedefaults Shutdown signal received, waiting for all workers to finish {"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceDefaults"}
2022-05-09T23:19:38.729Z INFO controller.servicesplitter Shutdown signal received, waiting for all workers to finish {"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceSplitter"}
2022-05-09T23:19:38.729Z INFO controller.servicerouter Shutdown signal received, waiting for all workers to finish {"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceRouter"}
2022-05-09T23:19:38.729Z INFO controller.serviceintentions Shutdown signal received, waiting for all workers to finish {"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceIntentions"}
2022-05-09T23:19:38.729Z INFO controller.serviceresolver Shutdown signal received, waiting for all workers to finish {"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceResolver"}
2022-05-09T23:19:38.729Z INFO controller.mesh Shutdown signal received, waiting for all workers to finish {"reconciler group": "consul.hashicorp.com", "reconciler kind": "Mesh"}
2022-05-09T23:19:38.729Z INFO controller.ingressgateway Shutdown signal received, waiting for all workers to finish {"reconciler group": "consul.hashicorp.com", "reconciler kind": "IngressGateway"}
2022-05-09T23:19:38.729Z INFO controller.proxydefaults Shutdown signal received, waiting for all workers to finish {"reconciler group": "consul.hashicorp.com", "reconciler kind": "ProxyDefaults"}
2022-05-09T23:19:38.729Z INFO controller.serviceintentions All workers finished {"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceIntentions"}
2022-05-09T23:19:38.729Z INFO controller.ingressgateway All workers finished {"reconciler group": "consul.hashicorp.com", "reconciler kind": "IngressGateway"}
2022-05-09T23:19:38.729Z INFO controller.servicesplitter All workers finished {"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceSplitter"}
2022-05-09T23:19:38.729Z INFO controller.servicerouter All workers finished {"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceRouter"}
2022-05-09T23:19:38.729Z INFO controller.mesh All workers finished {"reconciler group": "consul.hashicorp.com", "reconciler kind": "Mesh"}
2022-05-09T23:19:38.729Z INFO controller.proxydefaults All workers finished {"reconciler group": "consul.hashicorp.com", "reconciler kind": "ProxyDefaults"}
2022-05-09T23:19:38.729Z INFO controller.terminatinggateway All workers finished {"reconciler group": "consul.hashicorp.com", "reconciler kind": "TerminatingGateway"}
2022-05-09T23:19:38.729Z INFO controller.servicedefaults All workers finished {"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceDefaults"}
2022-05-09T23:19:38.729Z INFO controller.serviceresolver All workers finished {"reconciler group": "consul.hashicorp.com", "reconciler kind": "ServiceResolver"}
2022-05-09T23:19:38.729Z INFO controller-runtime.webhook shutting down webhook server
2022-05-09T23:19:38.730Z ERROR error received after stop sequence was engaged {"error": "context canceled"}
2022-05-09T23:19:38.729Z ERROR setup problem running manager {"error": "leader election lost"}
github.com/hashicorp/consul-k8s/control-plane/subcommand/controller.(*Command).Run
/home/circleci/project/project/control-plane/subcommand/controller/command.go:326
github.com/mitchellh/cli.(*CLI).Run
/go/pkg/mod/github.com/mitchellh/cli@v1.1.0/cli.go:260
main.main
/home/circleci/project/project/control-plane/main.go:17
runtime.main
/usr/local/go/src/runtime/proc.go:225
And last states of this pod :
State: Running
Started: Tue, 10 May 2022 01:19:39 +0200
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sun, 08 May 2022 09:11:04 +0200
Finished: Tue, 10 May 2022 01:19:38 +0200
Do you know what is the cause of theses errors ? maybe it can have a link with ours envoy issues
Thanks