Operator init terminated with exit code 143

Full disclosure this is for a development cluster at my work but since no-one I’ve spoken to seems to know the answer I’m stuck asking the community… Thanks in advance.

I’m attempting to get vault running on a k8s cluster on GKE with auto-unseal. I’ve been going through countless tutorials and docs to try to find all the pieces I need to have HA with Integrated Storage (raft) with TLS enabled on the listener and raft join stanzas.

I wound up using cfssl to generate my certs for the listener and I think it might be part of my issue (mostly because of a TLS handshake error):

I generated the certs for the listener and applied them as k8s secrets:

cfssl gencert -initca ./tls/ca-csr.json | cfssljson -bare ./tls/ca

cfssl gencert \
  -ca=./tls/ca.pem \
  -ca-key=./tls/ca-key.pem \
  -config=./tls/ca-config.json \
  -hostname="vault-0.vault-internal,vault-1.vault-internal,vault-2.vault-internal,127.0.0.1" \
  -profile=default \
  ./tls/ca-csr.json | cfssljson -bare ./tls/vault

kubectl apply -f <(kubectl -n vault \
  create secret tls tls-ca \
  --dry-run=client --output yaml \
  --cert ./tls/ca.pem \
  --key ./tls/ca-key.pem)

kubectl apply -f <(kubectl -n vault \
  create secret tls tls-server \
  --dry-run=client --output yaml \
  --cert ./tls/vault.pem \
  --key ./tls/vault-key.pem)

I’ve never had a reason to generate my own CA or certificates before, I have always found a way to use let’s encrypt or similar so this process is new to me.

My ca-config and ca-csr are as follows:

{
  "signing": {
    "default": {
      "expiry": "175200h"
    },
    "profiles": {
      "default": {
        "usages": ["signing", "key encipherment", "server auth", "client auth"],
        "expiry": "175200h"
      }
    }
  }
}
{
  "hosts": [
    "vault-0.vault-internal",
    "vault-1.vault-internal",
    "vault-2.vault-internal",
    "127.0.0.1"
  ],
  "key": {
    "algo": "rsa",
    "size": 2048
  },
  "names": [
    {
      "C": "US",
      "S": "State",
      "O": "Company",
      "OU": "SRE"
    }
  ]
}

My helm values are

# Vault Helm Chart Value Overrides
global:
  enabled: true
  tlsDisable: false

injector:
  enabled: true
  # Use the Vault K8s Image https://github.com/hashicorp/vault-k8s/
  image:
    repository: "hashicorp/vault-k8s"
    tag: "1.1.0"

  resources:
    requests:
      memory: 256Mi
      cpu: 250m
    limits:
      memory: 256Mi
      cpu: 250m

server:
  image:
    repository: "hashicorp/vault"
    tag: "1.12.2"

  # These Resource Limits were in line with node requirements in the
  # Vault Reference Architecture for a Small Cluster
  # I lowered the CPU requests and memory limits.
  resources:
    requests:
      memory: 8Gi
      cpu: 1500m
    limits:
      # Max node size we have currently is 12, if we need more we will need vault specific node pools.
      memory: 12Gi
      cpu: 2000m

  # For HA configuration and because we need to manually init the vault,
  # we need to define custom readiness/liveness Probe settings
  readinessProbe:
    enabled: true
    path: "/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204"
  livenessProbe:
    enabled: true
    path: "/v1/sys/health?standbyok=true"
    initialDelaySeconds: 60

  # extraEnvironmentVars is a list of extra environment variables to set with the stateful set. These could be
  # used to include variables required for auto-unseal.
  extraEnvironmentVars:
    VAULT_CACERT: /vault/userconfig/tls-ca/tls.crt

  # extraVolumes is a list of extra volumes to mount. These will be exposed
  # to Vault in the path `/vault/userconfig/<name>/`.
  extraVolumes:
    - type: secret
      name: tls-server
    - type: secret
      name: tls-ca
    - type: secret
      name: kms-credentials

  # This configures the Vault Statefulset to create a PVC for audit logs.
  # See https://www.vaultproject.io/docs/audit/index.html to know more
  auditStorage:
    enabled: true

  standalone:
    enabled: false

  service:
    type: ClusterIP

  # Run Vault in "HA" mode.
  ha:
    enabled: true
    raft:
      enabled: true
      setNodeId: true

      config: |
        ui = true
        listener "tcp" {
          address = "[::]:8200"
          cluster_address = "[::]:8201"
          tls_cert_file = "/vault/userconfig/tls-server/tls.crt"
          tls_key_file = "/vault/userconfig/tls-server/tls.key"
        }

        storage "raft" {
          path = "/vault/data"
          retry_join {
            leader_api_addr = "https://vault-0.vault-internal:8200"
            leader_ca_cert_file = "/vault/userconfig/tls-ca/tls.crt"
            leader_client_cert_file = "/vault/userconfig/tls-server/tls.crt"
            leader_client_key_file = "/vault/userconfig/tls-server/tls.key"
          }
          retry_join {
            leader_api_addr = "https://vault-1.vault-internal:8200"
            leader_ca_cert_file = "/vault/userconfig/tls-ca/tls.crt"
            leader_client_cert_file = "/vault/userconfig/tls-server/tls.crt"
            leader_client_key_file = "/vault/userconfig/tls-server/tls.key"
          }
          retry_join {
            leader_api_addr = "https://vault-2.vault-internal:8200"
            leader_ca_cert_file = "/vault/userconfig/tls-ca/tls.crt"
            leader_client_cert_file = "/vault/userconfig/tls-server/tls.crt"
            leader_client_key_file = "/vault/userconfig/tls-server/tls.key"
          }

          autopilot {
            cleanup_dead_servers = "true"
            last_contact_threshold = "200ms"
            last_contact_failure_threshold = "10m"
            max_trailing_logs = 250000
            min_quorum = 3
            server_stabilization_time = "10s"
          }
        }

        seal "gcpckms" {
          credentials = "/vault/userconfig/kms-credentials/sa.json"
          project = "<project>"
          region = "global"
          key_ring = "<key ring>"
          crypto_key = "<key>"
        }

        service_registration "kubernetes" {}

# Vault UI
ui:
  enabled: true
#  serviceType: "LoadBalancer"
  serviceNodePort: null
  externalPort: 8200

  # For Added Security, edit the below
  #loadBalancerSourceRanges:
  #   - < Your IP RANGE Ex. 10.0.0.0/16 >
  #   - < YOUR SINGLE IP Ex. 1.78.23.3/32 >

When I first deployed vault using helm and tailed the logs for vault-0 it showed a rejoin error printing that the vault was sealed so I proceeded to initialize it:

$ kubectl exec -it vault-0 -n vault -- vault operator init
    command terminated with exit code 143

This is obviously not what I expected (as I was expecting recovery keys since I used gcpckms for auto-unseal).

Now if I check the logs after that I can see where it appears to have unsealed but is getting an EOF on the TLS handshake:

==> Vault server configuration:

             Api Address: https://10.64.73.30:8200
                     Cgo: disabled
         Cluster Address: https://vault-0.vault-internal:8201
              Go Version: go1.19.3
              Listener 1: tcp (addr: "[::]:8200", cluster address: "[::]:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
               Log Level: info
                   Mlock: supported: true, enabled: false
           Recovery Mode: false
                 Storage: raft (HA available)
                 Version: Vault v1.12.2, built 2022-11-23T12:53:46Z
             Version Sha: 415e1fe3118eebd5df6cb60d13defdc01aa17b03

==> Vault server started! Log data will stream in below:
2023-01-18T19:29:37.673Z [INFO]  proxy environment: http_proxy="" https_proxy="" no_proxy=""
2023-01-18T19:29:37.673Z [WARN]  storage.raft.fsm: raft FSM db file has wider permissions than needed: needed=-rw------- existing=-rw-rw----
2023-01-18T19:29:37.945Z [INFO]  core: Initializing version history cache for core
2023-01-18T19:29:37.947Z [INFO]  core: raft retry join initiated
2023-01-18T19:29:37.947Z [INFO]  core: stored unseal keys supported, attempting fetch
2023-01-18T19:29:37.983Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=[::]:8201
2023-01-18T19:29:37.983Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2023-01-18T19:29:37.984Z [INFO]  storage.raft: creating Raft: config="&raft.Config{ProtocolVersion:3, HeartbeatTimeout:15000000000, ElectionTimeout:15000000000, CommitTimeout:50000000, MaxAppendEntries:64, BatchApplyCh:true, ShutdownOnRemove:true, TrailingLogs:0x2800, SnapshotInterval:120000000000, SnapshotThreshold:0x2000, LeaderLeaseTimeout:2500000000, LocalID:\"vault-0\", NotifyCh:(chan<- bool)(0xc000adc1c0), LogOutput:io.Writer(nil), LogLevel:\"DEBUG\", Logger:(*hclog.interceptLogger)(0xc000990c90), NoSnapshotRestoreOnStart:true, skipStartup:false}"
2023-01-18T19:29:37.986Z [INFO]  storage.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:vault-0 Address:vault-0.vault-internal:8201}]"
2023-01-18T19:29:37.986Z [INFO]  storage.raft: entering follower state: follower="Node at vault-0.vault-internal:8201 [Follower]" leader-address= leader-id=
2023-01-18T19:29:37.987Z [INFO]  core: entering standby mode
2023-01-18T19:29:37.987Z [WARN]  storage.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2023-01-18T19:29:37.987Z [INFO]  storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=12
2023-01-18T19:29:37.987Z [INFO]  core: vault is unsealed
2023-01-18T19:29:37.992Z [INFO]  storage.raft: election won: term=12 tally=1
2023-01-18T19:29:37.992Z [INFO]  storage.raft: entering leader state: leader="Node at vault-0.vault-internal:8201 [Leader]"
2023-01-18T19:29:37.999Z [INFO]  core: acquired lock, enabling active operation

2023-01-18T19:30:08.000Z [INFO]  http: TLS handshake error from 10.64.69.20:34142: EOF
2023-01-18T19:30:08.003Z [INFO]  http: TLS handshake error from 10.64.15.237:35992: EOF
2023-01-18T19:30:08.003Z [INFO]  http: TLS handshake error from 10.64.72.17:34230: EOF
2023-01-18T19:30:08.007Z [INFO]  http: TLS handshake error from 10.64.15.237:39660: EOF
2023-01-18T19:30:08.010Z [INFO]  http: TLS handshake error from 10.64.69.20:38614: EOF
2023-01-18T19:30:08.013Z [INFO]  http: TLS handshake error from 10.64.15.237:39644: EOF
2023-01-18T19:30:08.016Z [INFO]  http: TLS handshake error from 10.64.15.237:59774: EOF
2023-01-18T19:30:08.018Z [INFO]  http: TLS handshake error from 10.64.72.17:35624: EOF
2023-01-18T19:30:08.019Z [INFO]  http: TLS handshake error from 10.64.15.237:59762: EOF
==> Vault shutdown triggered
2023-01-18T19:30:52.180Z [INFO]  core: marked as sealed

Hoping someone has seen this or at least has some idea of the cause.

NVM… This has nothing to do with vault, checking the k8s pod status should have been the first thing I looked at after the logs…

recent pod events

  Warning  Unhealthy               22m (x2 over 22m)    kubelet                  Liveness probe failed: Get "https://10.64.73.30:8200/v1/sys/health?standbyok=true": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Normal   Killing                 22m                  kubelet                  Container vault failed liveness probe, will be restarted
  Warning  Unhealthy               21m                  kubelet                  Readiness probe failed: Get "https://10.64.73.30:8200/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204": dial tcp 10.64.73.30:8200: connect: connection refused
  Normal   Pulled                  21m (x2 over 23m)    kubelet                  Container image "hashicorp/vault:1.12.2" already present on machine
  Warning  Unhealthy               18m (x34 over 22m)   kubelet                  Readiness probe failed: Get "https://10.64.73.30:8200/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  BackOff                 8m (x25 over 13m)    kubelet                  Back-off restarting failed container
  Warning  Unhealthy               3m7s (x85 over 23m)  kubelet                  Readiness probe failed: Get "https://10.64.73.30:8200/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

another forum search and I found what I missed the first time somehow… TLS handshake error - #7 by jpolania

sorry for the relatively pointless post :joy:

Edit:
I’m not sure the linked post is my answer because looking at the events the health checks are being issued with https. Could this be related to the cert afterall?

Also, there’s still the matter of my recovery keys… So please, any thoughts would be greatly appreciated lol

I’m convinced at this point this is definitely the liveness check because of exit code 143 (this appears to be used for sigterm (from github converstions)).

I tried based on https://groups.google.com/g/vault-tool/c/E0uFODjMejE/m/QMlmf8bmBAAJ to add tls_disable_client_certs = "true" to the listener (though this doesn’t seem like a “fix” to me), but it doesn’t seem to have made a difference.

You have defined a liveness probe that only gives you a very short time window (about 60-70 seconds) to bring the cluster to a fully operational state, before Kubernetes will start killing pods and interfering with your initialisation.

You should not define such a liveness probe on a Vault cluster until after the cluster has been fully initialised and can reliably pass the probe.

Not helpful; if the liveness probe were failing because the vault was still sealed that wouldn’t be a problem (because the probes have query params for this reason). The issue here is related to TLS not related to whether Vault is initialized or not; initializing Vault and adding back the probe in this case won’t solve anything because the probes will continue to fail for the same reason…

Go double check the URL in the liveness probe in the Helm values you posted. It doesn’t have a query param relating to being sealed.

You’re missing my point, I could update it if it were the problem (I haven’t bothered yet because I know it’s not); as stated repeatedly the probes are failing because of TLS issues and not because the vault is sealed…

Edit: sorry, rereading this and the last comment, I don’t mean to come off rude. I appreciate that you’re trying to help, the thing is I can tell the reason the probes are failing is TLS so the actual status code doesn’t matter (at the moment). In the event it becomes the problem I can/will update those query params.

Edit 2: I should mention that I disabled the liveness (and readiness) probes to get passed this (per some separate advice), but I would love to re-enable them if possible, but that requires solving the reason they were failing.