[Vault agent injector] Retrieving large number of secrets

Hello :smiley:

I am working on a blockchain project, we are using the OSS framework Fabric combined with Vault PKI engine.
Configuring an Hyperledger Fabric network requires a certain number of certificates, keys and configuration files.
Once we’ve enabled our certificate authorities and issued all the material we need, we are storing this material in the KV secret engine.

When deploying our solution into k8s, we are using the agent injector to retrieve that suff from our Vault cluster.
Everything works like a charm (as usual with Hashicorp products :heart:).

However, I can see some warning in the vault agent init container about a potential DDoS due to the amount of secrets that we are fetching.

I do understand the intent of this warning, and we have setup the agent injector to only pre-populate those secrets meaning there is no sidecar container running and keeping them in sync.

Instead of recovering our secrets in one go, we tried to split the process as much as possible. Unfortunately we still have this warning and thus, I wonder if the approach is the right one.

My question is the following: What is your recommendation if we have to retrieve a certain amount of secrets (100+) using the vault agent injector ?

Hello!

I think I might be able to help you with this. But before jumping to any kind of advice, a couple of questions.

  1. Do you use any type of caching with the agent? If not, is this due design or due not having used it?
  2. Are all secrets necessary to be pre-populated?
  3. How often are static secrets updated?
  4. How many (%) of your secrets are dynamic?

Hello :slight_smile:

  1. No. It is due by design.
  2. Yes, we have a job fetching all secrets and mounting them into different PVCs. Those PVCs are then mounted to specific components. Each component has only access to the part of the PKI material that it requires.
  3. Very very rarely, they are the certificates and keys used in our PKI.
  4. As of now, none.

If it isn’t causing an outage on your cluster, you can basically ignore this. This is a hard waterline number that triggers the warning and does not take into account your cluster’s configuration or availability.

If you are having an impact on your cluster:

  • Cheap option, if latency allows, is to run another performanceReplicator (non-voter) node from your cluster dedicated to your Kubernetes cluster.
  • The expensive solution is to run a performance replicator cluster in your kubernetes that does the sync for you and is available locally.