Vault - PKI Engine performance degradation


I noticed an abnormal performance degradation of hashicorp vault - pki engine, when certificate requests were made in parallel. I need to know,
i. Whether this is expected ? (No publicly available SLA stat for HashiCorp vault)

If not,
i. what is causing the sever slowdown of concurrent certificate requests
ii. what are the suggestions to improve performance of the vault

I used following script in 16 parallel sessions to request certificates from pki engine

Following is the response time in milliseconds. P1 to P16 are different sessions (threads)

Please note that when the certificate requests were made from a single session, there is a significant performance gain

Please refer following graph (response time in milliseconds) for requesting certificates from a single session.

To setup pki engine, i used the guideline mentioned in PKI - Secrets Engines - HTTP API | Vault by HashiCorp (I can provide the script i used to setup pki engine if you think this could be a configuration issue). I used raft as the storage.

Vault version: Vault v1.4.2

What architecture? ELB? Node size? Auditing?
Outputs from dstat/iostat/etc - bottleneck should be visible in those.

Used C5 instances. Redhat 7.7. Yes, ALB with 3 nodes (HA setup with raft).

To narrow down the issue, tried in a physical (Intel Xeon E312xx, 24G memory, Redhat 7.4) machine (no ha setup, just a single node) as well. Performance degradation was the same.

Analyzed vault audit logs. Looks like response takes time. No indication of why it takes that much of time to respond when multiple requests receive at once.

What does the server show is under stress?

You might test with no_store true if it would be IO contention…

Thanks for the info. Yes could be io contention. Will update on this.
Found this post (Vault - High Availability and Scalability ) where they test 1000 concurrent requests in a non-enterprise vault instance in less than 3 seconds using t2.small instances. Since my requirement is lower than that, i should be able to fix this performance issue without going for “Performance standbys” (Performance Standby Nodes | Vault - HashiCorp Learn ) in vault enterprise instance or/ and any enterprise features in raft integrated storage. There were many insights in HashiCorp Vault Performance Benchmark | by Stenio Ferreira | HashiCorp Solutions Engineering Blog | Medium as well

This isn’t a valid comparison to your test. It appears they are doing KV reads? PKI is a much more CPU intensive and non-caching action.
A perf standby can service KV reads without going to the active node.
If you’re storing the certificates you’re issuing, having an infinite number of perf standbys won’t help as the request still has to forward to the active node for the write to occur.

1 Like

Thanks for highlighting on KV reads. I missed it. Yes, based on my observations, 1000 concurrent certificate requests within 3 secs for a t2.small was a surprise as well. My load is far less than that. Issuing around 20 concurrent requests within 3 sec is enough.

I hope right sizing and tweaking on backend configurations can achieve it. As you’ve highlighted, i’ll check the possibility of not storing certificates in the backend as well.

Check out this repo and pki test:

Let me know your results.

On a 10-year-old single-CPU box this is what I get (note: some in microseconds… 352942 μs = 352/ms)

    Latency   390.89ms  189.45ms   1.21s    72.94%
    Req/Sec     2.80      1.89    10.00     93.18%
  440 requests in 1.41m, 1.94MB read
Requests/sec:      5.20
Transfer/sec:     23.51KB
Audit is enabled. Eventually this should tell you which audit method as well.

JSON Output:
	"requests": 440,
	"duration_in_microseconds": 84570768.00,
	"bytes": 2036357,
	"requests_per_sec": 5.20,
	"bytes_transfer_per_sec": 24078.73,
	"latency_distribution": [
			"percentile": 50,
			"latency_in_microseconds": 352942
			"percentile": 75,
			"latency_in_microseconds": 482117
			"percentile": 90,
			"latency_in_microseconds": 632266
			"percentile": 99,
			"latency_in_microseconds": 1030345
			"percentile": 99.9,
			"latency_in_microseconds": 1211287
			"percentile": 99.99,
			"latency_in_microseconds": 1211287
			"percentile": 99.999,
			"latency_in_microseconds": 1211287
			"percentile": 100,
			"latency_in_microseconds": 0
1 Like