Consul sidecar injection not working

Hi All,
Facing an issue with sidecar injection. It fails in the init container : consul-connect-inject-init step. Tried various debugging but not able to find the root cause.
Error is :

[ERROR] Timed out waiting for service registration: error="did not find correct number of services, found: 0, services: map[]"

Following are the connect injector logs:

{"level":"info","ts":1653060188.246912,"logger":"controller.endpoints","msg":"ignoring because endpoints pods have not been injected","name":"test","ns":"search"}

We get the below in the consul client logs:

[ERROR] agent.http: Request error: method=GET url=/v1/acl/token/self?stale= from=10.140.132.67:33496 error="ACL not found"

However if we run /v1/acl/token/self?stale= after exec into the init container of the failing pod, it succeeds and we get the below response:

{"AccessorID":"3574d53c-c932-7cf1-6dfc-cbab8bfdd832","SecretID":"88a61d4b-c5d9-7168-35d5-825b7ba45550","Description":"token created via login: {\"pod\":\"test\"}","ServiceIdentities":[{"ServiceName":"test"}],"Local":true,"AuthMethod":"consul-consul-k8s-auth-method","CreateTime":"2022-05-24T13:20:39.150970114Z","Hash":"ZV9+Kxykl3bcsYdVxFRMaUfmz534rfgSWs1rqsn4m3g4=","CreateIndex":718941,"ModifyIndex":718941}

Also can see the below annotation in the pod definition:

consul.hashicorp.com/connect-inject-status: injected

From the Consul UI can see that the ACL token is also created for the pod.

One weird thing is that the sidecar injection works sometimes but that too takes a good amount of time, approximately 5-10 minutes. But for most times it does not work. However once injected it works perfectly fine, with new deployments, pod recreation / restarts,etc. The issue is only with enabling sidecars for the first time.

Have added the below annotations to the pod definition:

{{- if $.Values.service_mesh.enabled }}
        'consul.hashicorp.com/connect-service': "{{ $.Release.Name }}"
        'consul.hashicorp.com/connect-inject': 'true'
        'consul.hashicorp.com/transparent-proxy': 'false'
        'consul.hashicorp.com/connect-service-upstreams': {{ $.Values.service_mesh.upstreams | quote }}
{{- end }}

Also the service account has the same name as the service name.

Versions:
Consul : consul:1.11.1
Envoy : envoy-alpine:v1.18.2
Helm Chart : consul-0.41.1

Any leads would be greatly appreciated!

After some digging found solution for our issue seems to be in this GH issue. This has been released in 0.44.0 chart.
So applying this in our dev environment with also upgrading consul to 1.12.1 and envoy to 1.20.3, but seeing some issues around it.
About our setup, we are not using mesh gateways but directly connecting two DCs – VM DC (primary) <—> K8s DC (secondary). They have been connected via the following config in values.yaml:

server:
  ...
  extraConfig: |
    {
      "primary_datacenter": "primary",
      "retry_join_wan":["primary01", "primary02", "primary03"]
    }

When we try to upgrade the chart from 0.41.1 to 0.44.0, consul client and controller pods get stuck in init container phase.
We get the below log in controller init-acl container:

{"@level":"error","@message":"Consul login failed","@timestamp":"2022-05-31T15:06:29.635092Z","error":"error logging in: Unexpected response code: 403 (rpc error making call: Permission denied)"}

We find the below logs in the consul server:

+2022-05-31T15:12:35.429Z [ERROR] agent.http: Request error: method=PUT url=/v1/acl/binding-rule?dc=qa-us-gcp from=10.141.128.53:55674 error="rpc error making call: cannot find auth method with name "consul-consul-k8s-component-auth-method-dev-gcp-k8s""

Not sure why is it going to the VM primary DC to create the auth-method.
To solve this, we need to set the below config:

federation:
    k8sAuthMethodHost: https://<ip-of-API-Server-of-k8s-cluster>

This works, even though we have not enabled federation (since it support only mesh gateway for now). The auth-method consul-consul-k8s-component-auth-method-dev-gcp-k8s gets created in the VM DC. This gets the acl init Job to completion and the consul client pods come up normally.
For controller pod we found the binding rule to be missing.

/ # curl -k https://localhost:8501/v1/acl/binding-rules?token=<acl-token> | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   918  100   918    0     0   263k      0 --:--:-- --:--:-- --:--:--  298k
[
  {
    "ID": "1e6c6232-8ec5-de3f-bb30-9db3bd7b655c",
    "Description": "Kubernetes binding rule",
    "AuthMethod": "consul-consul-k8s-auth-method",
    "Selector": "serviceaccount.name!=default",
    "BindType": "service",
    "BindName": "${serviceaccount.name}",
    "CreateIndex": 47,
    "ModifyIndex": 670
  },
  {
    "ID": "59d0e5a3-2de6-d9ef-985c-836576baf9fd",
    "Description": "Binding Rule for consul-consul-client",
    "AuthMethod": "consul-consul-k8s-component-auth-method",
    "Selector": "serviceaccount.name==\"consul-consul-client\"",
    "BindType": "role",
    "BindName": "consul-consul-client-acl-role",
    "CreateIndex": 379,
    "ModifyIndex": 668
  },
  {
    "ID": "e9528a26-c4e1-33ad-8eac-e8ba8aedcb48",
    "Description": "Binding Rule for consul-consul-connect-injector",
    "AuthMethod": "consul-consul-k8s-component-auth-method",
    "Selector": "serviceaccount.name==\"consul-consul-connect-injector\"",
    "BindType": "role",
    "BindName": "consul-consul-connect-injector-acl-role",
    "CreateIndex": 382,
    "ModifyIndex": 671
  }
]
/ # 

Then, we manually add the binding rule for consul-consul-k8s-component-auth-method and the controller pods changes to running state.
Wanted to check if some thing is wrong / missing in the config and how we can overcome and do away with the manual workarounds.

The missing binding rule got created in the primary VM DC.
This seems to be due to this
But since we are not using federation via mesh gateway, controller pod is trying to use the local auth method. This seems to be due to this.
And hence controller pod got stuck in the init phase. To unblock we needed to create the binding rule manually in the secondary K8s DC.
Do we have a compulsion to use Mesh Gateway for federation? Are there any alternatives or plans to support other ways for federation?

Hi Narendra,
I’ve responded here: Is Mesh Gateway necessary for federation to work · Issue #1253 · hashicorp/consul-k8s · GitHub so let’s use that issue to discuss further.