Vault - PKI Engine performance degradation

thilinak · January 12, 2021, 9:32pm

Hi,

I noticed an abnormal performance degradation of hashicorp vault - pki engine, when certificate requests were made in parallel. I need to know,
i. Whether this is expected ? (No publicly available SLA stat for HashiCorp vault)

If not,
i. what is causing the sever slowdown of concurrent certificate requests
ii. what are the suggestions to improve performance of the vault

I used following script in 16 parallel sessions to request certificates from pki engine

Following is the response time in milliseconds. P1 to P16 are different sessions (threads)

Please note that when the certificate requests were made from a single session, there is a significant performance gain

Please refer following graph (response time in milliseconds) for requesting certificates from a single session.

To setup pki engine, i used the guideline mentioned in PKI - Secrets Engines - HTTP API | Vault by HashiCorp (I can provide the script i used to setup pki engine if you think this could be a configuration issue). I used raft as the storage.

Vault version: Vault v1.4.2

mikegreen · January 12, 2021, 9:45pm

What architecture? ELB? Node size? Auditing?
Outputs from dstat/iostat/etc - bottleneck should be visible in those.

thilinak · January 13, 2021, 6:22am

Used C5 instances. Redhat 7.7. Yes, ALB with 3 nodes (HA setup with raft).

To narrow down the issue, tried in a physical (Intel Xeon E312xx, 24G memory, Redhat 7.4) machine (no ha setup, just a single node) as well. Performance degradation was the same.

Analyzed vault audit logs. Looks like response takes time. No indication of why it takes that much of time to respond when multiple requests receive at once.

mikegreen · January 13, 2021, 10:27pm

What does the server show is under stress?

You might test with no_store true if it would be IO contention…

thilinak · January 15, 2021, 9:00am

Thanks for the info. Yes could be io contention. Will update on this.
Found this post (Vault - High Availability and Scalability ) where they test 1000 concurrent requests in a non-enterprise vault instance in less than 3 seconds using t2.small instances. Since my requirement is lower than that, i should be able to fix this performance issue without going for “Performance standbys” (Performance Standby Nodes | Vault - HashiCorp Learn ) in vault enterprise instance or/ and any enterprise features in raft integrated storage. There were many insights in HashiCorp Vault Performance Benchmark | by Stenio Ferreira | HashiCorp Solutions Engineering Blog | Medium as well

mikegreen · January 15, 2021, 4:56pm

This isn’t a valid comparison to your test. It appears they are doing KV reads? PKI is a much more CPU intensive and non-caching action.
A perf standby can service KV reads without going to the active node.
If you’re storing the certificates you’re issuing, having an infinite number of perf standbys won’t help as the request still has to forward to the active node for the write to occur.

thilinak · January 15, 2021, 9:17pm

Thanks for highlighting on KV reads. I missed it. Yes, based on my observations, 1000 concurrent certificate requests within 3 secs for a t2.small was a surprise as well. My load is far less than that. Issuing around 20 concurrent requests within 3 sec is enough.

I hope right sizing and tweaking on backend configurations can achieve it. As you’ve highlighted, i’ll check the possibility of not storing certificates in the backend as well.

mikegreen · January 15, 2021, 9:54pm

Check out this repo and pki test:

github.com

mikegreen/vault-benchmarking/blob/master/write-pki.lua

-- Script that writes secrets to pki engine in Vault
-- Indicate number of secrets to write to pki/example_pki path with "-- <N>"

local counter = 1
local threads = {}

function setup(thread)
   thread:set("id", counter)
   table.insert(threads, thread)
   counter = counter + 1
end

function os.capture(cmd, raw)
  local f = assert(io.popen(cmd, 'r'))
  local s = assert(f:read('*a'))
  f:close()
  if raw then return s end
  s = string.gsub(s, '^%s+', '')
  s = string.gsub(s, '%s+$', '')
  s = string.gsub(s, '[\n\r]+', ' ')

This file has been truncated. show original

Let me know your results.

On a 10-year-old single-CPU box this is what I get (note: some in microseconds… 352942 μs = 352/ms)

    Latency   390.89ms  189.45ms   1.21s    72.94%
    Req/Sec     2.80      1.89    10.00     93.18%
  440 requests in 1.41m, 1.94MB read
Requests/sec:      5.20
Transfer/sec:     23.51KB
Audit is enabled. Eventually this should tell you which audit method as well.

JSON Output:
{
	"requests": 440,
	"duration_in_microseconds": 84570768.00,
	"bytes": 2036357,
	"requests_per_sec": 5.20,
	"bytes_transfer_per_sec": 24078.73,
	"latency_distribution": [
		{
			"percentile": 50,
			"latency_in_microseconds": 352942
		},
		{
			"percentile": 75,
			"latency_in_microseconds": 482117
		},
		{
			"percentile": 90,
			"latency_in_microseconds": 632266
		},
		{
			"percentile": 99,
			"latency_in_microseconds": 1030345
		},
		{
			"percentile": 99.9,
			"latency_in_microseconds": 1211287
		},
		{
			"percentile": 99.99,
			"latency_in_microseconds": 1211287
		},
		{
			"percentile": 99.999,
			"latency_in_microseconds": 1211287
		},
		{
			"percentile": 100,
			"latency_in_microseconds": 0
		}
	]
}```

Topic		Replies	Views
Database secrets engine: Decreased performance on specific connection Vault vault	0	91	May 15, 2024
Increasing the no of thread in a hashicorp vault Vault k8s	6	101	September 4, 2024
Vault Performance Problem Vault	7	2028	February 19, 2020
HashiCorp Vault frequently changes the leader Vault vault	1	38	June 6, 2025
Moving pki engine to another vault cluster Vault	5	1806	October 18, 2021

Vault - PKI Engine performance degradation

Related topics