Consul Connect failing to register service

I am following the Consul Service Mesh tutorial (Getting Started with Consul Service Mesh for Kubernetes | Consul - HashiCorp Learn) and am running into an issue with some of the services failing to register.

My setup details:

  • 3 node cluster running RH OpenShift 4
  • Deploying Consul with Helm

Deploying Consul with Helm I needed to modify the tutorial’s config.yaml to work with OpenShift. I also had updated the Container Images to be enterprise and associated our enterprise license.

File: learn-consul-kubernetes/service-mesh/deploy/config.yaml

  name: consul
  datacenter: dc1
  imageEnvoy: envoyproxy/envoy:v1.18.4
    enabled: true
    enabled: true
    enableAgentMetrics: true
    secretName: 'consul-ent-license'
    secretKey: 'license'
  replicas: 1
  enabled: true
  enabled: true
  default: true
  enabled: true
  enabled: true
  enabled: true

helm install -f config.yaml consul hashicorp/consul --create-namespace -n test-servicemesh --version “0.39.0”

Deploying this config using Helm, Consul comes up fully and appears to be in a healthy state. The webui works as expected and all inspection commands listed in the aforementioned tutorial return expected results.

I then proceeded to deploy the Hashicups example listed in the tutorial. (kubectl -n test-servicemesh apply -f hashicups/) All 4 of the services provision, but then only some of them are able to successfully register with Consul. The common trend of successfully provision pods versus unsuccessful seems to be which host they are scheduled on. If the pod happens to be scheduled to the same Host that is running “consul-server-0” (or maybe “consul-connect-injector-webhook-deployment”), everything goes smoothly and the Pod is able to initialize as expected. Otherwise, the pod seems to get stuck in an initialization step with the consul-connect-inject-init sidecar holding up waiting for the service to be registered.

I’ve listed below what I’ve traced through so far, but I am unsure how to resolve it.

Failed Pod (sidecar log for consul-connect-inject-init):

2022-01-21T18:49:05.324Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:06.325Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:07.326Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:08.327Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:09.328Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:10.330Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:11.330Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:12.332Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:13.332Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:14.333Z [INFO]  Unable to find registered services; retrying
2022-01-21T18:49:14.333Z [INFO]  Check to ensure a Kubernetes service has been created for this application. If your pod is not starting also check the connect-inject deployment logs.

consul-connect-injector-webhook-deployment log:

2022-01-21T18:34:59.610Z	INFO	controller.endpoints	retrieved	{"name": "product-api", "ns": "test-servicemesh"}
2022-01-21T18:34:59.610Z	INFO	controller.endpoints	registering service with Consul	{"name": "product-api", "id": "product-api-6798bc4b4d-xqzsn-product-api", "agentIP": ""}
2022-01-21T18:35:29.611Z	ERROR	controller.endpoints	failed to register service	{"name": "product-api", "error": "Put \"\": dial tcp i/o timeout"}*Controller).Reconcile
2022-01-21T18:35:29.611Z	ERROR	controller.endpoints	failed to register services or health check	{"name": "product-api", "ns": "test-servicemesh", "error": "Put \"\": dial tcp i/o timeout"}*Controller).reconcileHandler
2022-01-21T18:35:29.614Z	ERROR	controller.endpoints	Reconciler error	{"reconciler group": "", "reconciler kind": "Endpoints", "name": "product-api", "namespace": "test-servicemesh", "error": "1 error occurred:\n\t* Put \"\": dial tcp i/o timeout\n\n"}*Controller).processNextWorkItem

consul-controller logs:

2022-01-21T17:34:01.125Z	INFO	webhooks.servicedefaults	validate create	{"name": "product-api"}
2022-01-21T17:34:01.221Z	INFO	controller.servicedefaults	config entry not found in consul	{"request": "test-servicemesh/product-api"}
2022-01-21T17:34:01.224Z	INFO	controller.servicedefaults	config entry created	{"request": "test-servicemesh/product-api", "request-time": "2.835998ms"}

I apologize if the timestamps aren’t in order, but this general trend of messages continues to loop over and over so it was hard for me to acquire a time consistent snapshot of logs.

One of the logs indicates that the Kubernetes service might not have been created. I confirmed that a service for each of the expected Hashicups services appears (frontend, public-api, product-api, and postgres).

As I mentioned earlier, my issue seems to be somehow related to pod placement where only pods that were lucky enough to land on a particular Host work as expected, the rest experience connectivity issues with the i/o timeouts shown above in the logs. My current suspicion is that it has something to do with the HostPort/HostNetworking and possible privileges/securityconstraints, but I am not sure how to pursue this thread further.

Thanks for any suggestions you may have!

Hi @ryan.cobb -
It sounds to me that the pods that are not working are being scheduled on nodes which are not running a consul agent. Can you confirm that the client-daemonset of Consul also got deployed and that the clients came online on every node?

Hi, yes I had also checked that. Each host is running a consul agent. I see three consul agents running in the “consul” daemonset.

oc -n test-servicemesh get ds
consul   3         3         3       3            3           <none>          2m28s

oc -n test-servicemesh get pod
NAME                                                          READY   STATUS     RESTARTS       AGE   IP             NODE                            NOMINATED NODE   READINESS GATES
consul-7tvnl                                                  1/1     Running    0              14m    ip-10-11-10-73.compute.internal   <none>           <none>
consul-connect-injector-webhook-deployment-77d5c4bf7c-dwtjm   1/1     Running    0              14m   ip-10-11-10-69.compute.internal   <none>           <none>
consul-connect-injector-webhook-deployment-77d5c4bf7c-llmx5   1/1     Running    0              14m    ip-10-11-10-73.compute.internal   <none>           <none>
consul-controller-6cd768bc76-ldttc                            1/1     Running    0              14m   ip-10-11-10-69.compute.internal   <none>           <none>
consul-hs2zc                                                  1/1     Running    0              14m    ip-10-11-10-67.compute.internal   <none>           <none>
consul-kctf7                                                  1/1     Running    0              14m   ip-10-11-10-69.compute.internal   <none>           <none>
consul-server-0                                               1/1     Running    0              14m   ip-10-11-10-69.compute.internal   <none>           <none>
consul-webhook-cert-manager-fcdf47f9b-mfndt                   1/1     Running    0              14m   ip-10-11-10-69.compute.internal   <none>           <none>
frontend-98cb6859b-wws7x                                      0/2     Init:1/2   5 (100s ago)   13m    ip-10-11-10-73.compute.internal   <none>           <none>
postgres-7cbb8d4cc-6qnff                                      2/2     Running    0              13m   ip-10-11-10-69.compute.internal   <none>           <none>
product-api-6798bc4b4d-dq47p                                  0/2     Init:1/2   5 (111s ago)   13m    ip-10-11-10-73.compute.internal   <none>           <none>
prometheus-server-5cbddcc44b-sq6gm                            2/2     Running    0              14m    ip-10-11-10-67.compute.internal   <none>           <none>
public-api-5bdf986897-sx2xv                                   0/2     Init:1/2   5 (108s ago)   13m    ip-10-11-10-67.compute.internal   <none>           <none>

Inspecting each consul-* pod, showed they are running on each distinct host with no overlap. In this you can see that the “postgres” service was able to successfully start up, but the other services are stuck in Init, being blocked by the connect sidecar failing to find registration.

Hi @kschoche,

I was able to get a temporary fix that allows service resolution to work correctly. The 3 node OpenShift cluster that I am running is deployed on AWS and all of its network policies were setup during OpenShift’s standard install process. One of the SecurityGroups that defined inbound/outbound rules for the worker nodes was allowing only the ports OpenShift needed. When I add port 8500 to this SecurityGroup, all of the previous i/o errors in the logs were fixed and services correctly registered allowing the pods to come up successfully regardless of what host they were scheduled onto.

My issue now, partly due to a lack of familiarly with Consul and OpenShift, is how to enact this inbound rule for 8500 correctly. The SecurityGroup that I changed is annotated as being “Created by OpenShift Installer” and provisioned by Terraform. Manually changing this SecurityGroup doesn’t seem correct and there is likely a more OpenShift/Kubernetes way to cause the desired change. I noticed the Consul Agents specify in their yaml files that they want hostPort : 8500, but this specification appears to be outright ignored by OpenShift. Adding hostNetwork: true didn’t seem to change anything.

Is there another way to allow the Consul Agents to expose 8500 on the host under an OpenShift environment? I had already set the consul-client ServiceAccount as “privileged” as a test to see if security privileges were getting in the way, but that didn’t seem to resolve the issue.

Additionally, I noticed in the Consul Agent’s yaml specification that 8502 is specified as a hostPort. What is this used for and do I also need to perform a similar configuration for this port?