Consul Connect failing to register service

ryan.cobb · January 21, 2022, 6:56pm

I am following the Consul Service Mesh tutorial (Getting Started with Consul Service Mesh for Kubernetes | Consul - HashiCorp Learn) and am running into an issue with some of the services failing to register.

My setup details:

3 node cluster running RH OpenShift 4
Deploying Consul with Helm

Deploying Consul with Helm I needed to modify the tutorial’s config.yaml to work with OpenShift. I also had updated the Container Images to be enterprise and associated our enterprise license.

File: learn-consul-kubernetes/service-mesh/deploy/config.yaml

global:
  name: consul
  datacenter: dc1
  image: registry.connect.redhat.com/hashicorp/consul-enterprise:1.11.1-ent-ubi
  imageEnvoy: envoyproxy/envoy:v1.18.4
  imageK8s: registry.connect.redhat.com/hashicorp/consul-k8s-control-plane:0.39.0-ubi
  openshift:
    enabled: true
  metrics:
    enabled: true
    enableAgentMetrics: true
  enterpriseLicense:
    secretName: 'consul-ent-license'
    secretKey: 'license'
server:
  replicas: 1
ui:
  enabled: true
connectInject:
  enabled: true
  default: true
controller:
  enabled: true
prometheus:
  enabled: true
grafana:
  enabled: true

helm install -f config.yaml consul hashicorp/consul --create-namespace -n test-servicemesh --version “0.39.0”

Deploying this config using Helm, Consul comes up fully and appears to be in a healthy state. The webui works as expected and all inspection commands listed in the aforementioned tutorial return expected results.

I then proceeded to deploy the Hashicups example listed in the tutorial. (kubectl -n test-servicemesh apply -f hashicups/) All 4 of the services provision, but then only some of them are able to successfully register with Consul. The common trend of successfully provision pods versus unsuccessful seems to be which host they are scheduled on. If the pod happens to be scheduled to the same Host that is running “consul-server-0” (or maybe “consul-connect-injector-webhook-deployment”), everything goes smoothly and the Pod is able to initialize as expected. Otherwise, the pod seems to get stuck in an initialization step with the consul-connect-inject-init sidecar holding up waiting for the service to be registered.

I’ve listed below what I’ve traced through so far, but I am unsure how to resolve it.

Failed Pod (sidecar log for consul-connect-inject-init):

2022-01-21T18:49:05.324Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:06.325Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:07.326Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:08.327Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:09.328Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:10.330Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:11.330Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:12.332Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:13.332Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:14.333Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:14.333Z [INFO]  Check to ensure a Kubernetes service has been created for this application. If your pod is not starting also check the connect-inject deployment logs.

consul-connect-injector-webhook-deployment log:

2022-01-21T18:34:59.610Z	INFO	controller.endpoints	retrieved	{"name": "product-api", "ns": "test-servicemesh"}
2022-01-21T18:34:59.610Z	INFO	controller.endpoints	registering service with Consul	{"name": "product-api", "id": "product-api-6798bc4b4d-xqzsn-product-api", "agentIP": "10.11.10.73"}
2022-01-21T18:35:29.611Z	ERROR	controller.endpoints	failed to register service	{"name": "product-api", "error": "Put \"http://10.11.10.73:8500/v1/agent/service/register\": dial tcp 10.11.10.73:8500: i/o timeout"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227
2022-01-21T18:35:29.611Z	ERROR	controller.endpoints	failed to register services or health check	{"name": "product-api", "ns": "test-servicemesh", "error": "Put \"http://10.11.10.73:8500/v1/agent/service/register\": dial tcp 10.11.10.73:8500: i/o timeout"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227
2022-01-21T18:35:29.614Z	ERROR	controller.endpoints	Reconciler error	{"reconciler group": "", "reconciler kind": "Endpoints", "name": "product-api", "namespace": "test-servicemesh", "error": "1 error occurred:\n\t* Put \"http://10.11.10.73:8500/v1/agent/service/register\": dial tcp 10.11.10.73:8500: i/o timeout\n\n"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/circleci/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:227

consul-controller logs:

2022-01-21T17:34:01.125Z	INFO	webhooks.servicedefaults	validate create	{"name": "product-api"}
2022-01-21T17:34:01.221Z	INFO	controller.servicedefaults	config entry not found in consul	{"request": "test-servicemesh/product-api"}
2022-01-21T17:34:01.224Z	INFO	controller.servicedefaults	config entry created	{"request": "test-servicemesh/product-api", "request-time": "2.835998ms"}

I apologize if the timestamps aren’t in order, but this general trend of messages continues to loop over and over so it was hard for me to acquire a time consistent snapshot of logs.

One of the logs indicates that the Kubernetes service might not have been created. I confirmed that a service for each of the expected Hashicups services appears (frontend, public-api, product-api, and postgres).

As I mentioned earlier, my issue seems to be somehow related to pod placement where only pods that were lucky enough to land on a particular Host work as expected, the rest experience connectivity issues with the i/o timeouts shown above in the logs. My current suspicion is that it has something to do with the HostPort/HostNetworking and possible privileges/securityconstraints, but I am not sure how to pursue this thread further.

Thanks for any suggestions you may have!

kschoche · January 21, 2022, 7:04pm

Hi @ryan.cobb -
It sounds to me that the pods that are not working are being scheduled on nodes which are not running a consul agent. Can you confirm that the client-daemonset of Consul also got deployed and that the clients came online on every node?

ryan.cobb · January 21, 2022, 7:06pm

Hi, yes I had also checked that. Each host is running a consul agent. I see three consul agents running in the “consul” daemonset.

oc -n test-servicemesh get ds
NAME     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
consul   3         3         3       3            3           <none>          2m28s

oc -n test-servicemesh get pod
NAME                                                          READY   STATUS     RESTARTS       AGE   IP             NODE                            NOMINATED NODE   READINESS GATES
consul-7tvnl                                                  1/1     Running    0              14m   10.128.3.14    ip-10-11-10-73.compute.internal   <none>           <none>
consul-connect-injector-webhook-deployment-77d5c4bf7c-dwtjm   1/1     Running    0              14m   10.129.3.130   ip-10-11-10-69.compute.internal   <none>           <none>
consul-connect-injector-webhook-deployment-77d5c4bf7c-llmx5   1/1     Running    0              14m   10.128.3.15    ip-10-11-10-73.compute.internal   <none>           <none>
consul-controller-6cd768bc76-ldttc                            1/1     Running    0              14m   10.129.3.129   ip-10-11-10-69.compute.internal   <none>           <none>
consul-hs2zc                                                  1/1     Running    0              14m   10.131.0.84    ip-10-11-10-67.compute.internal   <none>           <none>
consul-kctf7                                                  1/1     Running    0              14m   10.129.3.128   ip-10-11-10-69.compute.internal   <none>           <none>
consul-server-0                                               1/1     Running    0              14m   10.129.3.131   ip-10-11-10-69.compute.internal   <none>           <none>
consul-webhook-cert-manager-fcdf47f9b-mfndt                   1/1     Running    0              14m   10.129.3.127   ip-10-11-10-69.compute.internal   <none>           <none>
frontend-98cb6859b-wws7x                                      0/2     Init:1/2   5 (100s ago)   13m   10.128.3.16    ip-10-11-10-73.compute.internal   <none>           <none>
postgres-7cbb8d4cc-6qnff                                      2/2     Running    0              13m   10.129.3.132   ip-10-11-10-69.compute.internal   <none>           <none>
product-api-6798bc4b4d-dq47p                                  0/2     Init:1/2   5 (111s ago)   13m   10.128.3.17    ip-10-11-10-73.compute.internal   <none>           <none>
prometheus-server-5cbddcc44b-sq6gm                            2/2     Running    0              14m   10.131.0.85    ip-10-11-10-67.compute.internal   <none>           <none>
public-api-5bdf986897-sx2xv                                   0/2     Init:1/2   5 (108s ago)   13m   10.131.0.86    ip-10-11-10-67.compute.internal   <none>           <none>

Inspecting each consul-* pod, showed they are running on each distinct host with no overlap. In this you can see that the “postgres” service was able to successfully start up, but the other services are stuck in Init, being blocked by the connect sidecar failing to find registration.

ryan.cobb · January 21, 2022, 10:53pm

Hi @kschoche,

I was able to get a temporary fix that allows service resolution to work correctly. The 3 node OpenShift cluster that I am running is deployed on AWS and all of its network policies were setup during OpenShift’s standard install process. One of the SecurityGroups that defined inbound/outbound rules for the worker nodes was allowing only the ports OpenShift needed. When I add port 8500 to this SecurityGroup, all of the previous i/o errors in the logs were fixed and services correctly registered allowing the pods to come up successfully regardless of what host they were scheduled onto.

My issue now, partly due to a lack of familiarly with Consul and OpenShift, is how to enact this inbound rule for 8500 correctly. The SecurityGroup that I changed is annotated as being “Created by OpenShift Installer” and provisioned by Terraform. Manually changing this SecurityGroup doesn’t seem correct and there is likely a more OpenShift/Kubernetes way to cause the desired change. I noticed the Consul Agents specify in their yaml files that they want hostPort : 8500, but this specification appears to be outright ignored by OpenShift. Adding hostNetwork: true didn’t seem to change anything.

Is there another way to allow the Consul Agents to expose 8500 on the host under an OpenShift environment? I had already set the consul-client ServiceAccount as “privileged” as a test to see if security privileges were getting in the way, but that didn’t seem to resolve the issue.

Additionally, I noticed in the Consul Agent’s yaml specification that 8502 is specified as a hostPort. What is this used for and do I also need to perform a similar configuration for this port?

Topic		Replies	Views
Register services to Consul in Kubernetes Consul	4	1345	February 17, 2021
All workload pods stuck in CrashLoopBackoff after installing the hashicorp/consul chart in TKGi based kubernetes Consul k8s	10	2352	May 20, 2021
Consul service mesh connectivity issues Consul	8	2285	November 28, 2019
Kubernetes Client mode Consul is not accepting service Consul k8s	6	2310	May 27, 2020
Register service to consul cluster on OpenShift Consul	0	277	January 31, 2021

Consul Connect failing to register service

Related topics