Vault-agent-injector racing condition when application pods utilizing vault-sidecar come up faster then vault

Colbize · October 29, 2021, 4:09pm

We are seeing a racing condition with vault-agent-injector when we scale down our EKS worker nodes at night and then scale then up in the morning (for cost saving benefits of non-prod environments).

What happens is that some of our application pods that require ENV secrets from vault-sidecar are booting faster then the vault-agent-injector and vault pods. So the pod will come online, before the vault-agent-injector pod, and not receive a vault-agent-init or vault-agent pod. So then the application pod immeditaly goes into a unrecoverable crashbackloop with:

sh: 1: .: cannot open /vault/secrets/config: No such file

Since that file is sourced with ENV secrets when the pod comes up before running it’s entrypoint script. If we kill the pod and let it reboot the pod then is healthy.

So I know one solution would be to simply implement an init container in our application helm charts to wait/ping vault-agent-injector-svc.vault.svc until that service is resolving. However, I’m curious if this is a known issue and/or if there is a more “hashicorp” expected way of solving this problem.

TIA!

aram · October 29, 2021, 8:38pm

You should be using the init container if you just need access to your secret for startup. Sidecar is more of a proxy + caching ( as well as secret updates output just like init).

Colbize · October 29, 2021, 9:02pm

I’m not sure what you mean by “You should be using the init container”.
We are using the vault agent k8s injector which is what they suggested in their documentation Agent Sidecar Injector Overview | Vault by HashiCorp. The vault-agent-injector service pod creates the vault-agent-init container and the vault-agent container inside the annotated deployment of our service. We are using helm for templating to configure this deployment.

I guess I’m not seeing the documentation where it mentions using only the init container? Could you please give reference? TIA!

Our Helm code example

vault.hashicorp.com/agent-inject: {{ .Values.vault_agent_inject | quote }}
vault.hashicorp.com/role: "{{ .Values.vault_authentication_role }}"
# https://www.vaultproject.io/docs/platform/k8s/injector
vault.hashicorp.com/agent-inject-secret-config: "{{ .Values.vault_secrets_config_path }}"
# Environment variable export template
vault.hashicorp.com/agent-inject-template-config: |
{{ printf "{{- with secret"}} {{ .Values.vault_secrets_config_path | quote }} {{`-}} 
...

aram · October 29, 2021, 10:29pm

github.com

hashicorp/vault-helm/blob/4db9e831ad735826fe3bd799fad8f8d2149c3836/values.yaml#L323

    
      
          
          
# authDelegator enables a cluster role binding to be attached to the service
          # account.  This cluster role binding can be used to setup Kubernetes auth
          # method.  https://www.vaultproject.io/docs/auth/kubernetes.html
          authDelegator:
            enabled: true
          
          
# extraInitContainers is a list of init containers. Specified as a YAML list.
          # This is useful if you need to run a script to provision TLS certificates or
          # write out configuration files in a dynamic way.
          extraInitContainers: null
            # # This example installs a plugin pulled from github into the /usr/local/libexec/vault/oauthapp folder,
            # # which is defined in the volumes value.
            # - name: oauthapp
            #   image: "alpine"
            #   command: [sh, -c]
            #   args:
            #     - cd /tmp &&
            #       wget https://github.com/puppetlabs/vault-plugin-secrets-oauthapp/releases/download/v1.2.0/vault-plugin-secrets-oauthapp-v1.2.0-linux-amd64.tar.xz -O oauthapp.xz &&
            #       tar -xf oauthapp.xz &&
            #       mv vault-plugin-secrets-oauthapp-v1.2.0-linux-amd64 /usr/local/libexec/vault/oauthapp &&

Colbize · October 29, 2021, 10:38pm

Thanks for the response but that doesn’t appear to be related to the issue that I am currently facing. The issue that I am hitting is actually documented here:

github.com/hashicorp/vault-helm

Configurable failurePolicy for the MutatingWebhook of the vault agent injector

opened 04:23PM - 12 Oct 20 UTC

closed 01:20PM - 13 Oct 20 UTC

orirawlings

enhancement

**Is your feature request related to a problem? Please describe.** We observe s…ome occasional race conditions between recreation of the vault agent injector pod (for ex. due to node disruption) and the creation of other pods that expect a vault agent to be injected. Currently, the MutatingWebhook is configured with the default [`failurePolicy`](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#failure-policy) (i.e. `Ignore`) so when these pods are created while the injector is unavailable, they proceed through creation without the mutation. These pods often run applications that CrashLoopBackoff indefinitely because necessary access to secrets is never established. **Describe the solution you'd like** We'd like to be able to temporarily block pod creation when vault agent injector is unavailable, and as such, we'd like to expose the configuration of the MutatingWebhook's `failurePolicy` in the chart values. **Describe alternatives you've considered** None **Additional context** None

aram · October 29, 2021, 10:53pm

If you use the init container (pre-deployment-launch) then you wouldn’t have a race condition. It’s certainly possible that you have some weird implementation of Kubernetes that can cause this sort of oddity but that’s the answer in general.
If you think you can wait for that bug to be accepted and implemented as an enhancement then you’re welcome to.

Colbize · October 29, 2021, 11:20pm

It was included as an enhancement recently: Configuration | Vault by HashiCorp.

We just tried enabling it on our vault’s helm chart and it did seem to work to block the racing condition of our application’s pods that were deployed by helm when we tested scaling down and up the nodes.

I don’t see how the extraInitContainers link you provided is related at all to secret injection by The Vault Agent Injector

The issue is with application pods waiting for secrets to be injected by vault-agent-init not the vault pods themselves which is what you appear to be suggesting.

Colbize · November 2, 2021, 4:58pm

If anyone else experiences this issue. The solution was to enable failurePolicy: Fail for the injector in the Vault’s helm chart.

 injector:
  failurePolicy: Fail

This blocks pods from starting until the vault-agent-injector pod can come online. However, due to this bug you must use a injection selector to prevent pods that don’t need injection from being blocked. Otherwise the API will block all pods even ones not related or requesting vault services.

In our case adding:

injector:
  namespaceSelector:
    matchExpressions:
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values: ["vault","kube-system","kube-public","kube-node-lease"]

Did the trick.

Topic		Replies	Views
Vault agent injector dynamic secrets Vault	3	1285	April 17, 2020
Use Hashicorp vault in EKS 1.25 Vault k8s , vault	2	382	April 11, 2023
Vault agent injector failing Vault	0	1081	February 1, 2022
HashiCorp Vault doesn't inject SideCar container to new pods Vault vault	7	1760	March 24, 2023
Agent Sidecar Injector reads secret with delay Vault k8s	1	855	February 26, 2022

Vault-agent-injector racing condition when application pods utilizing vault-sidecar come up faster then vault

Related topics