We are using vault transit engine for storing encryption transit keys.
The memory keeps on monotonously increasing slowly until it reaches eks memory limits and the pod is recycled.
We ran a script that adds new transit keys to vault at a rate of 100,000 keys per hour at upto 900,000 keys per day. Due to this the rate of memory usage increases at a drastic rate and the memory limits are breached within few hours.
Initially we thought it was a cache issue so we set the cache size to 1. The issue still persisted.
Post that we disabled global cache disable_cache = true. The issue still persisted.
After running the script, we saw that there is a mismatch in vault.db size between integrated storage replicas. The raft db size of pod which was leader when we ran the script saw the db size to be increased to 16Gb, whereas other other raft storage replicas are at 12Gb. Is it expected that the vault db size are different.
We are tracing container_memory_working_set_bytes for memory.
Total number of keys: 3 million.
Setup information:
We are using vault free version in HA mode with 3 nodes: 1 master and 2 standby.
Vault version:-
1.12.0
Resource Limits:-
resources:
requests:
memory: 12Gi
cpu: 4000m
limits:
memory: 15Gi
cpu: 8000m
HA config:-
ha:
enabled: true
replicas: 3
raft:
enabled: true
setNodeId: true
config: |
ui = true
disable_cache = true
listener "tcp" {
tls_disable = 1
address = "[::]:8200"
cluster_address = "[::]:8201"
telemetry {
unauthenticated_metrics_access = "true"
}
}
storage "raft" {
path = "/vault/data"
retry_join {
leader_api_addr = "http://vault-prod-0.vault-prod-internal:8200"
}
retry_join {
leader_api_addr = "http://vault-prod-1.vault-prod-internal:8200"
}
retry_join {
leader_api_addr = "http://vault-prod-2.vault-prod-internal:8200"
}
autopilot {
cleanup_dead_servers = "true"
last_contact_threshold = "200ms"
last_contact_failure_threshold = "10m"
max_trailing_logs = 250000
min_quorum = 2
server_stabilization_time = "10s"
}
}
seal "awskms" {
region = <region>,
kms_key_id = <KMS key>
}
telemetry {
prometheus_retention_time = "30s",
disable_hostname = true
}
service_registration "kubernetes" {}