Vault-agent-injector racing condition when application pods utilizing vault-sidecar come up faster then vault

We are seeing a racing condition with vault-agent-injector when we scale down our EKS worker nodes at night and then scale then up in the morning (for cost saving benefits of non-prod environments).

What happens is that some of our application pods that require ENV secrets from vault-sidecar are booting faster then the vault-agent-injector and vault pods. So the pod will come online, before the vault-agent-injector pod, and not receive a vault-agent-init or vault-agent pod. So then the application pod immeditaly goes into a unrecoverable crashbackloop with:

sh: 1: .: cannot open /vault/secrets/config: No such file

Since that file is sourced with ENV secrets when the pod comes up before running it’s entrypoint script. If we kill the pod and let it reboot the pod then is healthy.

So I know one solution would be to simply implement an init container in our application helm charts to wait/ping vault-agent-injector-svc.vault.svc until that service is resolving. However, I’m curious if this is a known issue and/or if there is a more “hashicorp” expected way of solving this problem.

TIA!

You should be using the init container if you just need access to your secret for startup. Sidecar is more of a proxy + caching ( as well as secret updates output just like init).

I’m not sure what you mean by “You should be using the init container”.
We are using the vault agent k8s injector which is what they suggested in their documentation Agent Sidecar Injector Overview | Vault by HashiCorp. The vault-agent-injector service pod creates the vault-agent-init container and the vault-agent container inside the annotated deployment of our service. We are using helm for templating to configure this deployment.

I guess I’m not seeing the documentation where it mentions using only the init container? Could you please give reference? TIA!

Our Helm code example

vault.hashicorp.com/agent-inject: {{ .Values.vault_agent_inject | quote }}
vault.hashicorp.com/role: "{{ .Values.vault_authentication_role }}"
# https://www.vaultproject.io/docs/platform/k8s/injector
vault.hashicorp.com/agent-inject-secret-config: "{{ .Values.vault_secrets_config_path }}"
# Environment variable export template
vault.hashicorp.com/agent-inject-template-config: |
{{ printf "{{- with secret"}} {{ .Values.vault_secrets_config_path | quote }} {{`-}} 
...

Thanks for the response but that doesn’t appear to be related to the issue that I am currently facing. The issue that I am hitting is actually documented here:

If you use the init container (pre-deployment-launch) then you wouldn’t have a race condition. It’s certainly possible that you have some weird implementation of Kubernetes that can cause this sort of oddity but that’s the answer in general.
If you think you can wait for that bug to be accepted and implemented as an enhancement then you’re welcome to.

It was included as an enhancement recently: Configuration | Vault by HashiCorp.

We just tried enabling it on our vault’s helm chart and it did seem to work to block the racing condition of our application’s pods that were deployed by helm when we tested scaling down and up the nodes.

I don’t see how the extraInitContainers link you provided is related at all to secret injection by The Vault Agent Injector

The issue is with application pods waiting for secrets to be injected by vault-agent-init not the vault pods themselves which is what you appear to be suggesting.

If anyone else experiences this issue. The solution was to enable failurePolicy: Fail for the injector in the Vault’s helm chart.

 injector:
  failurePolicy: Fail

This blocks pods from starting until the vault-agent-injector pod can come online. However, due to this bug you must use a injection selector to prevent pods that don’t need injection from being blocked. Otherwise the API will block all pods even ones not related or requesting vault services.

In our case adding:

injector:
  namespaceSelector:
    matchExpressions:
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values: ["vault","kube-system","kube-public","kube-node-lease"]

Did the trick.