Question about Vault failover test in EKS

Hi all,
I am using Vault in EKS. The backend storage is using Raft. When conducting a failover test, there is about 15 to 25 seconds of 5xx errors occurring. Is this normal?

I am using an ALB, and the active pod is operating in a healthy state. When the active pod is removed for testing, it takes about 15 to 25 seconds to return to a 200 status.

Hi @x980707x ,

What are the health checks set up for on the ALB? Health checks for your target groups - Elastic Load Balancing

If you can consider changing, it is recommended to use an NLB

Hi @jonathanfrappier,

I have made the following settings: by modifying all the settings of the target group to the minimum values, the process that used to take about 25 seconds now takes about 16 seconds. However, this still seems like a long time. If I switch to NLB, could the time be reduced further? Or have I made an incorrect configuration?

# Vault Helm Chart Value Overrides
vault:
  fullnameOverride: vault
  global:
    enabled: true
    tlsDisable: true
  injector:
    enabled: false
  server:
    image:
      repository: "hashicorp/vault"
      tag: "1.15.1"
      pullPolicy: IfNotPresent
    readinessProbe:
      enabled: true
      path: "/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204"
    livenessProbe:
      enabled: true
      path: "/v1/sys/health?standbyok=true"
      initialDelaySeconds: 60
    standalone:
      enabled: false
    resources:
      requests:
        memory: 256Mi
        cpu: 250m
      limits:
        memory: 256Mi
        cpu: 250m
    ingress:
      enabled : true
      annotations:
        kubernetes.io/ingress.class: alb
        alb.ingress.kubernetes.io/scheme: internal
        alb.ingress.kubernetes.io/target-type: ip
        alb.ingress.kubernetes.io/security-groups: ...
        alb.ingress.kubernetes.io/certificate-arn: ...
        alb.ingress.kubernetes.io/subnets: ...
        alb.ingress.kubernetes.io/healthcheck-path: /v1/sys/health
        alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=0
        alb.ingress.kubernetes.io/healthcheck-timeout-seconds: '5'
        alb.ingress.kubernetes.io/healthy-threshold-count: '2'
        alb.ingress.kubernetes.io/unhealthy-threshold-count: '2'
        alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
        alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
      hosts:
        - host: ...
          paths:
            - /
    ha:
      enabled: true
      replicas: 5
      raft:
        enabled: true
        config: |
          ui = true
          api_addr = "http://POD_IP:8200"
          listener "tcp" {
            tls_disable = 1
            address = "0.0.0.0:8200"
            cluster_address = "0.0.0.0:8201"
            telemetry {
              unauthenticated_metrics_access = "true"
            }          
          }
          storage "raft" {
            path = "/vault/data"
            retry_join {
              auto_join = "provider=k8s label_selector=\"app.kubernetes.io/name=vault,vault-initialized=true,component=server\" namespace=\"{{ .Release.Namespace }}\""
              auto_join_scheme ="http"
            }  
          }
          seal "awskms" {
            region     = "ap-northeast-2"
            kms_key_id = 	"..."
          }
          telemetry {
            prometheus_retention_time = "30s"
            disable_hostname = true
          }          
          service_registration "kubernetes" {}

  ui:
    enabled: true

  service_monitor:
    enabled: true

Thanks!

tl;dr - 16 seconds, based on the provided configuration sounds correct when looking at the individual settings.

alb.ingress.kubernetes.io/healthcheck-timeout-seconds: ‘5’

The amount of time, in seconds, during which no response from a 
target means a failed health check. 

alb.ingress.kubernetes.io/unhealthy-threshold-count: ‘2’

The number of consecutive failed health checks required before 
considering a target unhealthy.

With a 5 second timeout, and requiring 2 consecutive unhealthy checks (which is the min you can set) and accounting for some period of time from taking the node down before the next health check begins, I would expect to see something in the 15ish second range. To me your configuration seems to match what you are experiencing based on my understanding/experience with an AWS ALB (I used these in my last job to load balance and failover our customer-facing apps but I have not been solely responsible for AWS things for a few years).

Looking over the AWS NLB settings, the health checks there seem to be similar with a minimum unhealthy check of 2 but you can drop the health check time out to 2sec vs a min of 5sec on the ALB, so you might be able to get that down to between 5-10s.

The question I would ask is whether the change in the load balancer type and other testing I would have to do to ensure proper operations is worth the potential ~6 seconds (I would still expect some variation based on timing of a node going down vs when the next health check runs). That would largely depend on your application to Vault workload.

For example, if you are using only the KV secrets engine for periodic retrieval of secrets, that 16s may be okay (assuming of course you’re not constantly taking Vault nodes up and down) but applications using Transit at a high volume may be worth the change (though the workload would still need to operate around the ~10sec failover).