Vault Agent Injector Not Being Triggered

Hello, everyone. I am trying to get the vault-agent-injector working in my K8s cluster, and am seeing an issue where the mutating web hook does not seem to get triggered. Can anyone please give feedback on the best next steps? I have included all my steps taken below.

Vault Version: 1.13.2

  1. In vault create a secret engine named test-kv . In test-kv secret engine create one secret named secret1:
 {
  "key1": "value1"
}
  1. Enable the injector and debug logging in the helm chart and apply. Verify the helm chart values are correctly assigned by running helm get values:
injector:
  enabled: true
  logLevel: debug
  1. In vault create a policy named service-policy with the following:
path "test-kv/data/*" {
  capabilities = ["read"]
}

Verify the path is valid by running read test-kv/data/secret1 from the Vault Web CLI.

  1. Create a K8s service account in K8s cluster:
    kubectl create serviceaccount session-service-account

  2. Create a Vault role to bind the policy to the K8s service account:

#enables Kubernetes authentication
write sys/auth/kubernetes type=kubernetes 

write auth/kubernetes/role/session-service-role \
    bound_service_account_names=session-service-account \
    bound_service_account_namespaces=test \
    policies=service-policy \
    ttl=24h
  1. Verify it got set correctly by running this in the Vault web CLI:
read auth/kubernetes/role/session-service-role
  1. Reviewed the vault-agent-injector pod configuration.

kubectl get mutatingwebhookconfigurations vault-agent-injector-cfg -o yaml

In the output we can see it is enabled to run for all namespaces:

namespaceSelector: {}
  1. Create a simple pod to see if it will trigger the web hook:
apiVersion: v1
kind: Pod
metadata:
  name: test-pod-for-vault
  namespace: test
  annotations:
    vault.hashicorp.com/agent-inject: "true"
    vault.hashicorp.com/role: "session-service-role"
    vault.hashicorp.com/agent-inject-secret-secret1: "test-kv/data/secret1"
spec:
  serviceAccountName: session-service-account
  containers:
  - name: ubuntu
    image: ubuntu:latest

The pod comes up successfully, but nothing gets added to the pod showing the vault-agent-injector is working. I tried the following troubleshooting steps to see what is causing that:

Injector Logs

Even though debug logging is enabled, nothing is there in the vault-agent-injector pod logs and container logs, so it seems the web hook does not get triggered for some reason. The logs only display content like the following:

[INFO]  handler.certwatcher: Webhooks changed. Updating certs...

Network Policy

Reviewed the network policy and ensured my Vault namespace accepts ingress traffic from my test namespace.

As a test, I tried to exec into my test pod above and do a curl request with the CA cert to my K8s cluster:
curl --header "X-Vault-Token: $TOKEN" $VAULT_ADDR/v1/sys/health

This is successful. For $VAULT_ADDR, I am using the hostname and not the internal K8s DNS path (vault..svc.cluster.local).

Check Kubernetes Auth

a. Perform a GET request to https://<my_host>/v1/auth/kubernetes/auth/session-service/role. In the output, I can see the following:

bound_service_account_namespaces: "test"
policies: "service-policy"

b. Perform a GET request to https://<my_host>/v1/auth/kubernetes/config. In the output I see the settings are there. To test if the kubernetes_ca_cert is valid, I tried running kubectl get pod while manually passing the certificate there. It is working, so the cert seems valid.

For kubernetes_host I am using https://kubernetes.default.svc.cluster.local.

Admission Controller Configuration

At this point I wondered if mutating web hooks were enabled at all for the cluster. In the master node, I reviewed the kube-apiserver yaml and confirmed --enable-admission-plugins has both MutatingAdmissionWebhook and ValidatingAdmissionWebhook there.

Thank you in advance.

Hi @nat-ray,

First of all, kudos for sharing such a detailed description of your tests :heart: It really helps us get a feeling for what the issue might be.

Though I haven’t figured out a solution for your problem, there are a few things that stand out:

Are you using the Helm chart from https://helm.releases.hashicorp.com? As far as I can see, both the Chart.yaml and the values.yaml of the latest chart refer to version 1.13.1. Not that it would be an issue to update the image version… more just to be on the same page.

Perhaps, sharing the complete output without the sensitive parts would be more helpful.

I’m assuming this means no initContainer or sidecar.

That’s the most noteworthy part as far as I can tell.

That’s a bit confusing since the curl command actually shows a request to Vault using the X-Vault-Token

This is also a bit confusing. When you say hostname, do you mean the Pod name only or the Ingress host (assuming you’re exposing Vault via Ingress)? That makes a big difference when talking about Network Policies.

I wouldn’t worry too much about the Kubernetes auth for now. You can fix that later if needed. Once the Vault agent gets injected you’ll be able to see auth errors in the Pod logs.

So back to the Pod admission control, I would try to find out why Pod CREATE requests are not reaching the /mutate path on the Vault injector service. Assuming that you’re using the official chart, there’s little room for something to be wrong in that configuration, so I probably would go back to the Network Policies and try to understand if your assumptions were correct.

1 Like

Hi Marco, thanks for your reply! I’ll try to clarify:

Yes, we are using the Helm chart from https://helm.releases.hashicorp.com. I just checked again and confirmed the version is set to 1.13.2.

Sure!

kubectl get mutatingwebhookconfigurations vault-agent-injector-cfg -o yaml
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  annotations:
    meta.helm.sh/release-name: vault
    meta.helm.sh/release-namespace: vault
  creationTimestamp: "2021-01-26T15:12:07Z"
  generation: 1061
  labels:
    app.kubernetes.io/instance: vault
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: vault-agent-injector
  name: vault-agent-injector-cfg
  resourceVersion: <removed>
  uid: <removed>
webhooks:
- admissionReviewVersions:
  - v1
  - v1beta1
  clientConfig:
    caBundle: <removed>
    service:
      name: vault-agent-injector-svc
      namespace: vault
      path: /mutate
      port: 443
  failurePolicy: Ignore
  matchPolicy: Exact
  name: vault.hashicorp.com
  namespaceSelector: {}
  objectSelector:
    matchExpressions:
    - key: app.kubernetes.io/name
      operator: NotIn
      values:
      - vault-agent-injector
  reinvocationPolicy: Never
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    resources:
    - pods
    scope: '*'
  sideEffects: None
  timeoutSeconds: 30

Yes, that’s right! The pod comes up but only the ubuntu container is there. It doesn’t seem to have done anything with Vault as far as I can tell.

I wanted to see if I could connect from my test pod to Vault, to verify if network connectivity was possible. I hope I am explaining this well.

I am exposing Vault in my K8s cluster using an nginx ingress on my internal network, and it is assigned a hostname of vault.region.orgname.com. So here are the commands I run:

kubectl exec -it test-pod-for-vault -- /bin/bash
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
VAULT_ADDR='https://vault.region.orgname.com'

curl --header "X-Vault-Token: $TOKEN" $VAULT_ADDR/v1/sys/health

I can connect and get valid output from the Vault API. But if I set VAULT_ADDR=https://vault.vault.svc.cluster.local' then I receive a timeout error.

Good to know!

Is there some other way to test this to verify?

I think here is where the problem lies. First, just a comment, the $TOKEN you’re sending as a header doesn’t make much of a difference since that endpoint does not require authentication.

But the fact that you can access it via Ingress and not via Service does say a lot. When that curl request is made to https://vault.region.orgname.com, the Network Policies that are taken into consideration are the Egress rules for test-pod-for-vault and the Ingress rules for your Ingress controllers which are probably both allowing 0.0.0.0/0 (plus the relevant rules for the inter-namespace communication). At this point, I would just take a better look at the Network Policies you have in place.

It’s not unheard of, when Network Policies with default deny are created for every namespace, certain functionalities of the cluster that are assumed to just always work, start failing.

Possible scenarios include:

  • Loss of connectivity to CoreDNS.
  • No communication between Pods in the same namespace.
  • Kubernetes API not able to reach Webhook Pods (see this reference)

A good tutorial can be found here.

1 Like

Thank you for pointing that out. Good to know!

We have a default deny configured in our Vault namespace, and then have a second policy configured to allow incoming connections from a few select namespaces. We don’t restrict the outgoing connections.

As a test, I updated the policy to allow the incoming connections from my test-pod namespace, but still did not see anything. I also noticed that we do not have any network policy specific to the vault web hook. I am wondering if this may be related.

Assuming that the Vault Webhook is running in the same namespace as Vault, the Network Policy needs to allow not only the test namespace but also the Kubernetes API itself to call the Webhook endpoint. See my last point in the post above.

I think that the web hook is not bound to a namespace unless there is something I am not following. As a test I tried the following today:

  1. In my vault namespace, delete the deny-all network policy.

  2. Get the API server IP:
    kubectl get svc kubernetes -n default -o=jsonpath='{.spec.clusterIP}'

  3. In the vault namespace, add a network policy for the web hook and API server following the example elastic article you shared:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-webhook-access-from-apiserver
  namespace: vault
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: vault-agent-injector
  policyTypes:
  - Ingress
  ingress:
  - from:
    - ipBlock:
        cidr: <IP here>/32
    ports:
    - port: 443
  1. Delete and recreate my test pod:
apiVersion: v1
kind: Pod
metadata:
  name: test-pod-for-vault
  namespace: test
  annotations:
    vault.hashicorp.com/agent-inject: "true"
    vault.hashicorp.com/role: "session-service-role"
    vault.hashicorp.com/agent-inject-secret-secret1: "test-kv/data/secret1"
spec:
  serviceAccountName: session-service-account
  containers:
  - name: ubuntu
    image: ubuntu:latest
    command: ["bash"]
    args: ["-c", "sleep infinity"]

Result:

  • The pod comes up, but I do not see any Vault container is injected.
  • We do not see anything added to the vault-agent-injector debug log.
  • I also noticed that I still cannot curl from my test pod to the vault pod:
curl https://vault.vault.svc.cluster.local/v1/sys/health
curl: (28) Failed to connect to vault.vault.svc.cluster.local port 443 after 130003 ms: Connection timed out

If you look in the YAML definition of your mutating webhook, you’ll see:

  clientConfig:
    caBundle: <removed>
    service:
      name: vault-agent-injector-svc
      namespace: vault
      path: /mutate
      port: 443

Which means the service listening for the webhook is namespaced and subject to network policies just as any other. Unfortunately (or fortunately), not even the Kube API is exempt from those policies.

Note that once you create that policy, you automatically create a deny all for everything else that’s not explicitly allowed. That explains why the test pod wasn’t able to connect to Vault.

Still not sure why this is happening. Can you check the Kubernetes API audit (and container) logs during the same timeframe?

That means requests are not getting there.

Because of the default deny I mentioned above.

Just to confirm the hypothesis, can you try the same without any Network Policies in the namespace, assuming that Egress from other namespaces is always allowed?

1 Like

Hello, not this issue is version based, I experienced the issue too, with v1.13.1 and I retest with v1.12.1

The cause of this error could be your mounted TLS certs which might not be correct, in my case that was the cause.

1 Like

This makes sense. I tried temporarily removing all the network policies in my vault namespace, and the webhook started to get triggered. Now I need to figure out a policy that will allow the webhook to get triggered.

1 Like

I think this is exactly where I am at now. My vault-agent-init container log shows the following:

2023-05-26T18:39:36.753Z [ERROR] agent.auth.handler: error authenticating:
  error=
  | Error making API request.
  |
  | URL: PUT https://vault.vault.svc:8200/v1/auth/kubernetes/login
  | Code: 403. Errors:
  |
  | * permission denied

I wondered if my service account did not have the permissions to login to Vault, so I did a POST request to http://<vault_host>/v1/auth/kubernetes/login and passed this in the body of Postman:

{"jwt": "<service-account-token>", "role": "session-service-role"}

I also get a permission denied error from that.

I then went to my Vault policy to double check if that was right:

read auth/kubernetes/role/session-service-role

Key                              Value                         
alias_name_source                serviceaccount_uid            
bound_service_account_names      ["session-service-account"]
bound_service_account_namespaces ["test"]                      
policies                         ["session-service-policy"] 
token_bound_cidrs                []                            
token_explicit_max_ttl           0                             
token_max_ttl                    0                             
token_no_default_policy          false                         
token_num_uses                   0                             
token_period                     0                             
token_policies                   ["session-service-policy"] 
token_ttl                        86400                         
token_type                       default                       
ttl                              86400    

I can also see my policy here in Vault and the name matches. I was reviewing this article which goes into the steps to set this up, but it’s not too clear how what I’m doing is much different than the steps they shared.

Actually the Vault server logs would be more helpful in this case.

I also didn’t see the Kubernetes auth plugin configuration in your initial post (only the Vault role for the Kubernetes service account). These are the relevant sections of the documentation:

  1. Configuration
  2. Kubernetes 1.21
  3. Configuring Kubernetes
1 Like

There was a problem with my Vault Kubernetes auth config. I went through the process again and the vault agent injector is now able to inject the secrets into my test pod! Thank you so much for the help.

I am trying to troubleshoot the network policy issue further now. This is my current policy:

kind: NetworkPolicy
metadata:
  name: base-allow
  namespace: vault
spec:
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: vault
  - from:
    - namespaceSelector:
        matchLabels:
          name: istio-system
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
  - from:
    - namespaceSelector:
        matchLabels:
          name: kube-system
  - from:
    - namespaceSelector:
        matchLabels:
          name: test
  podSelector: {}
  policyTypes:
  - Ingress

One thing I noticed is that my kube-system namespace did not have the name=kube-system label. I thought that might be why the webhook is unreachable. I added the label in the kube-system namespace, then tried recreating my test pod. The issue persists – the vault agent injector webhook does not get triggered if this network policy is there.

I also reviewed the K8s audit log like you mentioned, and am using jq to see only the entries related to Vault where an error occurred:

jq 'select(.responseStatus.code >= 400 and (.requestURI? | if . then test("vault") else false end))' kube-apiserver-audit.json

I see some forbidden errors here:

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "<removed>",
  "stage": "ResponseComplete",
  "requestURI": "/api/v1/namespaces/vault/configmaps/ingress-controller-leader",
  "verb": "update",
  "user": {
    "username": "<removed>-nginx-ingress-internal",
    "uid": "<removed>",
    "groups": [
      "system:serviceaccounts",
      "system:serviceaccounts:vault",
      "system:authenticated"
    ],
    "extra": {
      "authentication.kubernetes.io/pod-name": [
        "<removed>"
      ],
      "authentication.kubernetes.io/pod-uid": [
        "<removed>"
      ]
    }
  },
  "sourceIPs": [
    "<removed>"
  ],
  "userAgent": "nginx-ingress-controller/v1.0.0 (linux/amd64) ingress-nginx/<removed>",
  "objectRef": {
    "resource": "configmaps",
    "namespace": "vault",
    "name": "ingress-controller-leader",
    "apiVersion": "v1"
  },
  "responseStatus": {
    "metadata": {},
    "status": "Failure",
    "reason": "Forbidden",
    "code": 403
  },
  "requestReceivedTimestamp": "2023-05-30T17:06:14.057600Z",
  "stageTimestamp": "2023-05-30T17:06:14.057883Z",
  "annotations": {
    "authorization.k8s.io/decision": "forbid",
    "authorization.k8s.io/reason": ""
  }
}

I’m not sure if this is very helpful. I also tried filtering on just the entries for webhooks:

jq 'select(.objectRef.resource == "validatingwebhookconfigurations")' kube-apiserver-audit.json

And saw many entries like the following:

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "<removed>",
  "stage": "ResponseStarted",
  "requestURI": "/apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations?allowWatchBookmarks=true&resourceVersion=removed&timeout=6m23s&timeoutSeconds=383&watch=true",
  "verb": "watch",
  "user": {
    "username": "system:apiserver",
    "uid": "<removed>",
    "groups": [
      "system:masters"
    ]
  },
  "sourceIPs": [
    "::1"
  ],
  "userAgent": "kube-apiserver/v1.23.13 (linux/amd64) kubernetes/592eca0",
  "objectRef": {
    "resource": "validatingwebhookconfigurations",
    "apiGroup": "admissionregistration.k8s.io",
    "apiVersion": "v1"
  },
  "responseStatus": {
    "metadata": {},
    "code": 200
  },
  "requestReceivedTimestamp": "2023-05-30T17:01:37.779145Z",
  "stageTimestamp": "2023-05-30T17:01:37.779534Z",
  "annotations": {
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": ""
  }
}

Everything is status code 200.

Does anything here stand out to you? Is there anything that you would filter on to get a better idea of why it’s still failing?

The recommendation, starting from Kubernetes 1.22 is to use the label kubernetes.io/metadata.name (just FYI).

These can be ignored for the purpose of this topic.

You should probably search for mutatingwebhookconfigurations instead of validatingwebhookconfigurations

1 Like

Thank you, I updated my vault policy to use kubernetes.io/metadata.name and confirmed those are in all of my namespaces already (I’m on K8s 1.23).

I then tried to re-create my test pod and verified the web hook is not reached.

I downloaded the latest audit log and tried running the following:

jq ‘select(.objectRef.resource == “mutatingwebhookconfigurations”)’ kube-apiserver-audit.log

Everything returned shows decision allowed and is for the same service account, the istio operator. It is just the same entry repeating in the log:

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "<removed>",
  "stage": "ResponseStarted",
  "requestURI": "/apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations?allowWatchBookmarks=true&resourceVersion=<removed>&timeoutSeconds=470&watch=true",
  "verb": "watch",
  "user": {
    "username": "system:serviceaccount:istio-operator:istio-operator-<removed>",
    "uid": "<removed>",
    "groups": [
      "system:serviceaccounts",
      "system:serviceaccounts:istio-operator",
      "system:authenticated"
    ],
    "extra": {
      "authentication.kubernetes.io/pod-name": [
        "istio-operator-<removed>"
      ],
      "authentication.kubernetes.io/pod-uid": [
        "<removed>"
      ]
    }
  },
  "sourceIPs": [
    "<removed>"
  ],
  "userAgent": "operator/v0.0.0 (linux/amd64) kubernetes/$Format",
  "objectRef": {
    "resource": "mutatingwebhookconfigurations",
    "apiGroup": "admissionregistration.k8s.io",
    "apiVersion": "v1"
  },
  "responseStatus": {
    "metadata": {},
    "code": 200
  },
  "requestReceivedTimestamp": "2023-05-31T13:52:55.391686Z",
  "stageTimestamp": "2023-05-31T13:52:55.393466Z",
  "annotations": {
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"istio-operator-<removed>\" of ClusterRole \"istio-operator-<removed>\" to ServiceAccount \"istio-operator-<removed>\""
  }
}

If I run jq 'select(.annotations."authorization.k8s.io/decision" == "deny")' kube-apiserver-audit.log there is nothing returned.

I was reviewing this clusterrole for istio-operator. It has very wide-ranging permissions within the cluster. Compared to my vault-agent-injector clusterrole and clusterrolebinding, it only has the following in the vault namespace only:

PolicyRule:
  Resources                                                   Non-Resource URLs  Resource Names  Verbs
  ---------                                                   -----------------  --------------  -----
  mutatingwebhookconfigurations.admissionregistration.k8s.io  []                 []              [get list watch patch]

It’s not too clear to me if this might be the cause.

I guess the API wouldn’t receive a “deny” response because it never reaches the webhook.

Mine looks the same and works :wink:

I’m kinda running out of ideas here :pensive:. Perhaps actual Kube API container logs would help. It seems to me you’re managing the cluster yourself so it shouldn’t be hard to get those.

1 Like

That makes sense.

I am seeing entries like this in my kube-apiserver pods:

I0530 15:55:06.430166      11 trace.go:205] Trace[642276701]: "Call mutating webhook" configuration:vault-agent-injector-cfg,webhook:vault.hashicorp.com,resource:/v1, Resource=pods,subresource:,operation:CREATE,UID:<removed> (30-May-2023 15:54:36.429) (total time: 30000ms):
Trace[642276701]: [30.000413526s] [30.000413526s] END
W0530 15:55:06.430211      11 dispatcher.go:180] Failed calling webhook, failing open vault.hashicorp.com: failed calling webhook "vault.hashicorp.com": failed to call webhook: Post "https://vault-agent-injector-svc.vault.svc:443/mutate?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
E0530 15:55:06.430241      11 dispatcher.go:184] failed calling webhook "vault.hashicorp.com": failed to call webhook: Post "https://vault-agent-injector-svc.vault.svc:443/mutate?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
I0530 15:55:06.442992      11 trace.go:205] Trace[1808959561]: "Create" url:/api/v1/namespaces/test/pods,user-agent:kubectl/v1.25.4 (darwin/arm64) kubernetes/<removed>,audit-id:<removed>,client:M<removed>,accept:application/json,protocol:HTTP/2.0 (30-May-2023 15:54:36.411) (total time: 30031ms):
Trace[1808959561]: ---"Object stored in database" 30031ms (15:55:06.442)
Trace[1808959561]: [30.031645204s] [30.031645204s] END

It seems like this is the same issue shown before where I could not curl from my test pod to the vault service.

Yeah, it just confirms our suspicion.

What CNI do you use?

Next thing I would do is try to understand why the traffic is being blocked when the namespace is allowed as a source. Perhaps the CNI pod logs will give you a hint.

I am using Calico. The thing about that though is that I have 0 Calico network policies enabled, and I have only 1 global network policy here to block the EC2 metadata from being accessed:

apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: allow-all-egress-except-ec2-metadata
  resourceVersion: <removed>
  uid: <removed>
spec:
  egress:
  - action: Deny
    destination:
      nets:
      - 169.254.169.254/32
    protocol: TCP
    source: {}
  - action: Allow
    destination:
      nets:
      - 0.0.0.0/0
    source: {}
  selector: all()
  types:
  - Egress

In my Calico pod logs, I only see this one entry appearing each time I try to create my test pod:

calico-node-<removed> calico-node 2023-05-31 16:52:06.272 [INFO][75] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"vault/vault-agent-injector-<removed>", EndpointId:"<removed>"} status="up"

Perhaps kube-proxy logs then or even getting a shell on a node and inspecting iptables rules? Not really sure what direction to take now.

What if you use Calico Network Policies instead of using Kubernetes native Network Policies and try to debug them using calicoctl?