We have installed vault with file storage on a kubernetes cluster. When running 1000 login and create request the latency is more than 15 secs. This highly unacceptable in a production environment. The no. of threads running inside the vault is 24. Is there a way to increase the no of parallel threads?
Hi @rekha - certainly don’t want (or expect) you to have 15-second login time but there are many factors.
What version of Vault?
When you say file storage do you mean integrated storage?
How many nodes in the Vault cluster?
What resources are available to the nodes in the Kubernetes cluster (CPU, disk type, network type, memory, networking, etc)?
How many nodes in the Kubernetes cluster?
What is the resource availability of the nodes in the Kubernetes cluster?
How much resources have been assigned to the Vault pod?
How have resources been isolated (vauld pod vs others)?
What is the Vault configuration and auth method being used for these login requests?
Any errors in the logs (and where are the logs stored, if Vault can’t write to the logs that could cause a problem)?
Are there any odd/anomalous metrics from the Vault pod when these 1000 login attempts are coming in?
Where are the login attempts coming from? Are they in the same kubernetes cluster (and region) or are they separate?
This guide may also be helpful:
As well as this to benchmark your configuration:
Thank you @jonathanfrappier for reviewing my query. Please refer to my response below
What version of Vault?
vault 1.15.5
When you say file storage do you mean integrated storage?
storage": {“file”: {“path”: “/vault/data”}.This is parameter in vault.hcl. The folder is mounted on an EFS and the storage used is file
How many nodes in the Vault cluster?
single
What resources are available to the nodes in the Kubernetes cluster (CPU, disk type, network type, memory, networking, etc)?
t3 medium (2vcpu, 4GB RAM) & 4 t3 xlarge(each 4vCPU, 16GB RAM),
How many nodes in the Kubernetes cluster?
2 nodes
How much resources have been assigned to the Vault pod?
2 vcpu, 8GB RAM
How have resources been isolated (vauld pod vs others)?
vault request routed by a seperate ELB-ingress
What is the Vault configuration and auth method being used for these login requests?
Logins are done using approle role_id and secret_id with TTL of 20m
Any errors in the logs (and where are the logs stored, if Vault can’t write to the logs that could cause a problem)?
No errors, getting response after 12s
Are there any odd/anomalous metrics from the Vault pod when these 1000 login attempts are coming in?
Not able to setup the metrics.
Where are the login attempts coming from? Are they in the same kubernetes cluster (and region) or are they separate?
From a JMeter setup within the organization in the same region
We tried increasing the replicas to 12 and reducing the resource for each replica(0.25cpu, 2GB RAM) the latency reduced to about 3s but is not consistent. Sometimes took about 15s. Please let me know if there is a way to debug? What other processes could be increasing the latency?
Thanks for providing that additional detail. Is it a safe assumption that your kubernetes cluster is EKS/Amazon based (made that jump from your node sizing response)?
One thing that initially jumps out, you mention your kubernetes nodes are sized as:
Blockquote
t3 medium (2vcpu, 4GB RAM) & 4 t3 xlarge(each 4vCPU, 16GB RAM)
A general recommendation for “small” Vault nodes is to have ~ 2-4 cores & 8-16 GB RAM each, so there could be some sizing issues at play, but hard to say without metrics.
If my assumption about EKS/AWS is correct, are able to to set up metrics for the EC2 instances/Kubernetes nodes?
And possibly something like Container Insights - Amazon CloudWatch for the Vault pod?
Thanks a lot @jonathanfrappier -
We are using a AWS/EKS cluster. Increase the pod size to 4vCpu & 16GB RAM and tested. The latency is now ~12s which is still high.
The request flow - login using approle role_id and secret_id
write to a kv path (The path has about 5M objects)
login again
read the path
Additionally when using multiple pods with a single EFS file storage, although the latency was lower, there is an issue with data duplication.
Did you also increase the node size in the cluster? If nodes are only sized with 16GB RAM each, then pod would in theory be exhausting the node resources. That could potentially introduce disk swap, which is why I was asking if you could set up metrics. I am just guessing without metrics from both Vault and Kubernetes.
Another question, when these nodes are taking 12s to log in, have you tried to simultaneously log in on your own? Curious if other sources see the same login behavior.
@jonathanfrappier I could not upload the cpu & memory utilization screenshot. The pod utilization when we run 100 login and create requests is - 0.1 cpu and 130MB. When the pod utilization is low, fail to understand the need to keep the resources at 4cpu & 16GB RAM.
The load balancer’s target response time is showing 12s. Only the vault traffic is directed to this load balancer.
There is marked improvement if I increase the number of pods. But am facing an issue with duplication of data. The write to a KV mount path is not overwriting on existing data but is creating a new one. Is this an issue where reads are from the cache of the individual pods? Can we disable the cache reads?