I am following the Consul Service Mesh tutorial (Getting Started with Consul Service Mesh for Kubernetes | Consul - HashiCorp Learn) and am running into an issue with some of the services failing to register.
My setup details:
- 3 node cluster running RH OpenShift 4
- Deploying Consul with Helm
Deploying Consul with Helm I needed to modify the tutorial’s config.yaml to work with OpenShift. I also had updated the Container Images to be enterprise and associated our enterprise license.
File: learn-consul-kubernetes/service-mesh/deploy/config.yaml
global:
name: consul
datacenter: dc1
image: registry.connect.redhat.com/hashicorp/consul-enterprise:1.11.1-ent-ubi
imageEnvoy: envoyproxy/envoy:v1.18.4
imageK8s: registry.connect.redhat.com/hashicorp/consul-k8s-control-plane:0.39.0-ubi
openshift:
enabled: true
metrics:
enabled: true
enableAgentMetrics: true
enterpriseLicense:
secretName: 'consul-ent-license'
secretKey: 'license'
server:
replicas: 1
ui:
enabled: true
connectInject:
enabled: true
default: true
controller:
enabled: true
prometheus:
enabled: true
grafana:
enabled: true
helm install -f config.yaml consul hashicorp/consul --create-namespace -n test-servicemesh --version “0.39.0”
Deploying this config using Helm, Consul comes up fully and appears to be in a healthy state. The webui works as expected and all inspection commands listed in the aforementioned tutorial return expected results.
I then proceeded to deploy the Hashicups example listed in the tutorial. (kubectl -n test-servicemesh apply -f hashicups/) All 4 of the services provision, but then only some of them are able to successfully register with Consul. The common trend of successfully provision pods versus unsuccessful seems to be which host they are scheduled on. If the pod happens to be scheduled to the same Host that is running “consul-server-0” (or maybe “consul-connect-injector-webhook-deployment”), everything goes smoothly and the Pod is able to initialize as expected. Otherwise, the pod seems to get stuck in an initialization step with the consul-connect-inject-init sidecar holding up waiting for the service to be registered.
I’ve listed below what I’ve traced through so far, but I am unsure how to resolve it.
Failed Pod (sidecar log for consul-connect-inject-init):
2022-01-21T18:49:05.324Z [INFO] Unable to find registered services; retrying
2022-01-21T18:49:06.325Z [INFO] Unable to find registered services; retrying
2022-01-21T18:49:07.326Z [INFO] Unable to find registered services; retrying
2022-01-21T18:49:08.327Z [INFO] Unable to find registered services; retrying
2022-01-21T18:49:09.328Z [INFO] Unable to find registered services; retrying
2022-01-21T18:49:10.330Z [INFO] Unable to find registered services; retrying
2022-01-21T18:49:11.330Z [INFO] Unable to find registered services; retrying
2022-01-21T18:49:12.332Z [INFO] Unable to find registered services; retrying
2022-01-21T18:49:13.332Z [INFO] Unable to find registered services; retrying
2022-01-21T18:49:14.333Z [INFO] Unable to find registered services; retrying
2022-01-21T18:49:14.333Z [INFO] Check to ensure a Kubernetes service has been created for this application. If your pod is not starting also check the connect-inject deployment logs.
consul-connect-injector-webhook-deployment log:
2022-01-21T18:34:59.610Z INFO controller.endpoints retrieved {"name": "product-api", "ns": "test-servicemesh"}
2022-01-21T18:34:59.610Z INFO controller.endpoints registering service with Consul {"name": "product-api", "id": "product-api-6798bc4b4d-xqzsn-product-api", "agentIP": "10.11.10.73"}
2022-01-21T18:35:29.611Z ERROR controller.endpoints failed to register service {"name": "product-api", "error": "Put \"http://10.11.10.73:8500/v1/agent/service/register\": dial tcp 10.11.10.73:8500: i/o timeout"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227
2022-01-21T18:35:29.611Z ERROR controller.endpoints failed to register services or health check {"name": "product-api", "ns": "test-servicemesh", "error": "Put \"http://10.11.10.73:8500/v1/agent/service/register\": dial tcp 10.11.10.73:8500: i/o timeout"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227
2022-01-21T18:35:29.614Z ERROR controller.endpoints Reconciler error {"reconciler group": "", "reconciler kind": "Endpoints", "name": "product-api", "namespace": "test-servicemesh", "error": "1 error occurred:\n\t* Put \"http://10.11.10.73:8500/v1/agent/service/register\": dial tcp 10.11.10.73:8500: i/o timeout\n\n"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227
consul-controller logs:
2022-01-21T17:34:01.125Z INFO webhooks.servicedefaults validate create {"name": "product-api"}
2022-01-21T17:34:01.221Z INFO controller.servicedefaults config entry not found in consul {"request": "test-servicemesh/product-api"}
2022-01-21T17:34:01.224Z INFO controller.servicedefaults config entry created {"request": "test-servicemesh/product-api", "request-time": "2.835998ms"}
I apologize if the timestamps aren’t in order, but this general trend of messages continues to loop over and over so it was hard for me to acquire a time consistent snapshot of logs.
One of the logs indicates that the Kubernetes service might not have been created. I confirmed that a service for each of the expected Hashicups services appears (frontend, public-api, product-api, and postgres).
As I mentioned earlier, my issue seems to be somehow related to pod placement where only pods that were lucky enough to land on a particular Host work as expected, the rest experience connectivity issues with the i/o timeouts shown above in the logs. My current suspicion is that it has something to do with the HostPort/HostNetworking and possible privileges/securityconstraints, but I am not sure how to pursue this thread further.
Thanks for any suggestions you may have!