Vault memory increases monotonously with time

yashbharadwaj · April 11, 2023, 7:56pm

We are using vault transit engine for storing encryption transit keys.
The memory keeps on monotonously increasing slowly until it reaches eks memory limits and the pod is recycled.

We ran a script that adds new transit keys to vault at a rate of 100,000 keys per hour at upto 900,000 keys per day. Due to this the rate of memory usage increases at a drastic rate and the memory limits are breached within few hours.

Initially we thought it was a cache issue so we set the cache size to 1. The issue still persisted.
Post that we disabled global cache disable_cache = true. The issue still persisted.

After running the script, we saw that there is a mismatch in vault.db size between integrated storage replicas. The raft db size of pod which was leader when we ran the script saw the db size to be increased to 16Gb, whereas other other raft storage replicas are at 12Gb. Is it expected that the vault db size are different.

We are tracing container_memory_working_set_bytes for memory.

Total number of keys: 3 million.

Setup information:

We are using vault free version in HA mode with 3 nodes: 1 master and 2 standby.

Vault version:-
1.12.0

Resource Limits:-
resources:
requests:
memory: 12Gi
cpu: 4000m
limits:
memory: 15Gi
cpu: 8000m

HA config:-
ha:
enabled: true
replicas: 3
raft:
enabled: true
setNodeId: true

      config: |
        ui = true
        disable_cache = true
        listener "tcp" {
          tls_disable = 1
          address = "[::]:8200"
          cluster_address = "[::]:8201"
          telemetry {
          unauthenticated_metrics_access = "true"
          }
        }

        storage "raft" {
          path = "/vault/data"
          retry_join {
            leader_api_addr = "http://vault-prod-0.vault-prod-internal:8200"
          }
          retry_join {
            leader_api_addr = "http://vault-prod-1.vault-prod-internal:8200"
          }
          retry_join {
            leader_api_addr = "http://vault-prod-2.vault-prod-internal:8200"
          }
          autopilot {
            cleanup_dead_servers = "true"
            last_contact_threshold = "200ms"
            last_contact_failure_threshold = "10m"
            max_trailing_logs = 250000
            min_quorum = 2
            server_stabilization_time = "10s"
          }
        }

        seal "awskms" {
          region = <region>,
          kms_key_id = <KMS key>
        }

        telemetry {
          prometheus_retention_time = "30s",
          disable_hostname = true
        }

        service_registration "kubernetes" {}

maxb · April 11, 2023, 10:05pm

This is the wrong cache to be disabling.

There is a cache specific to the transit secrets engine, which defaults to unlimited retention in memory. Please see this configuration API call: Transit - Secrets Engines - HTTP API | Vault | HashiCorp Developer

In my personal experience, it is quite common for .0 (or even .1, .2) releases of Vault to have quality issues - you really should be upgrading at least to the latest 1.12.x release.

anshjain · April 12, 2023, 5:01am

Hi @maxb, thanks for the reply

After we set “disable_cache = true”, We hit the endpoint transit/cache-config (as mentioned in the doc), it says

Error reading transit/cache-config: Error making API request.

URL: GET http://127.0.0.1:8200/v1/transit/cache-config
Code: 500. Errors:

* 1 error occurred:
	* caching is disabled for this transit mount

So it seems like this cache is also disabled.
Can this be possible it is still caching or anything else?
Also We currently have around 800K leases in Vault, but its not like they got suddenly increased or anything like that.

karanb192 · May 8, 2023, 11:39am

@maxb , can you please help here with further steps?

yashbharadwaj · June 23, 2023, 1:59pm

@maxb we set our requests and limits quotas equal in eks configuration. That seems to have resolved the issue.

yashbharadwaj · June 23, 2023, 2:01pm

Please refer to this thread for more information: kubelet counts active page cache against memory.available (maybe it shouldn't?) · Issue #43916 · kubernetes/kubernetes · GitHub

Topic		Replies	Views
Vault k8 Pod taking too much memory Vault k8s , vault	2	372	April 12, 2023
Vault constantly consumes all memory on host - Too many leases? Vault	1	1486	November 5, 2020
Vault pod restart time is increasing with increase in vault.db size Vault k8s	6	308	June 28, 2023
Why is vault using so much memory Vault k8s , raft	8	2342	November 16, 2022
Vault update from 1.8.2 to 1.10.2: High memory usage (400 MB to 8 GB) Vault k8s , vault	1	542	May 26, 2022

Vault memory increases monotonously with time

Related topics