Full disclosure this is for a development cluster at my work but since no-one I’ve spoken to seems to know the answer I’m stuck asking the community… Thanks in advance.
I’m attempting to get vault running on a k8s cluster on GKE with auto-unseal. I’ve been going through countless tutorials and docs to try to find all the pieces I need to have HA with Integrated Storage (raft) with TLS enabled on the listener and raft join stanzas.
I wound up using cfssl to generate my certs for the listener and I think it might be part of my issue (mostly because of a TLS handshake error):
I generated the certs for the listener and applied them as k8s secrets:
cfssl gencert -initca ./tls/ca-csr.json | cfssljson -bare ./tls/ca
cfssl gencert \
-ca=./tls/ca.pem \
-ca-key=./tls/ca-key.pem \
-config=./tls/ca-config.json \
-hostname="vault-0.vault-internal,vault-1.vault-internal,vault-2.vault-internal,127.0.0.1" \
-profile=default \
./tls/ca-csr.json | cfssljson -bare ./tls/vault
kubectl apply -f <(kubectl -n vault \
create secret tls tls-ca \
--dry-run=client --output yaml \
--cert ./tls/ca.pem \
--key ./tls/ca-key.pem)
kubectl apply -f <(kubectl -n vault \
create secret tls tls-server \
--dry-run=client --output yaml \
--cert ./tls/vault.pem \
--key ./tls/vault-key.pem)
I’ve never had a reason to generate my own CA or certificates before, I have always found a way to use let’s encrypt or similar so this process is new to me.
My ca-config and ca-csr are as follows:
{
"signing": {
"default": {
"expiry": "175200h"
},
"profiles": {
"default": {
"usages": ["signing", "key encipherment", "server auth", "client auth"],
"expiry": "175200h"
}
}
}
}
{
"hosts": [
"vault-0.vault-internal",
"vault-1.vault-internal",
"vault-2.vault-internal",
"127.0.0.1"
],
"key": {
"algo": "rsa",
"size": 2048
},
"names": [
{
"C": "US",
"S": "State",
"O": "Company",
"OU": "SRE"
}
]
}
My helm values are
# Vault Helm Chart Value Overrides
global:
enabled: true
tlsDisable: false
injector:
enabled: true
# Use the Vault K8s Image https://github.com/hashicorp/vault-k8s/
image:
repository: "hashicorp/vault-k8s"
tag: "1.1.0"
resources:
requests:
memory: 256Mi
cpu: 250m
limits:
memory: 256Mi
cpu: 250m
server:
image:
repository: "hashicorp/vault"
tag: "1.12.2"
# These Resource Limits were in line with node requirements in the
# Vault Reference Architecture for a Small Cluster
# I lowered the CPU requests and memory limits.
resources:
requests:
memory: 8Gi
cpu: 1500m
limits:
# Max node size we have currently is 12, if we need more we will need vault specific node pools.
memory: 12Gi
cpu: 2000m
# For HA configuration and because we need to manually init the vault,
# we need to define custom readiness/liveness Probe settings
readinessProbe:
enabled: true
path: "/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204"
livenessProbe:
enabled: true
path: "/v1/sys/health?standbyok=true"
initialDelaySeconds: 60
# extraEnvironmentVars is a list of extra environment variables to set with the stateful set. These could be
# used to include variables required for auto-unseal.
extraEnvironmentVars:
VAULT_CACERT: /vault/userconfig/tls-ca/tls.crt
# extraVolumes is a list of extra volumes to mount. These will be exposed
# to Vault in the path `/vault/userconfig/<name>/`.
extraVolumes:
- type: secret
name: tls-server
- type: secret
name: tls-ca
- type: secret
name: kms-credentials
# This configures the Vault Statefulset to create a PVC for audit logs.
# See https://www.vaultproject.io/docs/audit/index.html to know more
auditStorage:
enabled: true
standalone:
enabled: false
service:
type: ClusterIP
# Run Vault in "HA" mode.
ha:
enabled: true
raft:
enabled: true
setNodeId: true
config: |
ui = true
listener "tcp" {
address = "[::]:8200"
cluster_address = "[::]:8201"
tls_cert_file = "/vault/userconfig/tls-server/tls.crt"
tls_key_file = "/vault/userconfig/tls-server/tls.key"
}
storage "raft" {
path = "/vault/data"
retry_join {
leader_api_addr = "https://vault-0.vault-internal:8200"
leader_ca_cert_file = "/vault/userconfig/tls-ca/tls.crt"
leader_client_cert_file = "/vault/userconfig/tls-server/tls.crt"
leader_client_key_file = "/vault/userconfig/tls-server/tls.key"
}
retry_join {
leader_api_addr = "https://vault-1.vault-internal:8200"
leader_ca_cert_file = "/vault/userconfig/tls-ca/tls.crt"
leader_client_cert_file = "/vault/userconfig/tls-server/tls.crt"
leader_client_key_file = "/vault/userconfig/tls-server/tls.key"
}
retry_join {
leader_api_addr = "https://vault-2.vault-internal:8200"
leader_ca_cert_file = "/vault/userconfig/tls-ca/tls.crt"
leader_client_cert_file = "/vault/userconfig/tls-server/tls.crt"
leader_client_key_file = "/vault/userconfig/tls-server/tls.key"
}
autopilot {
cleanup_dead_servers = "true"
last_contact_threshold = "200ms"
last_contact_failure_threshold = "10m"
max_trailing_logs = 250000
min_quorum = 3
server_stabilization_time = "10s"
}
}
seal "gcpckms" {
credentials = "/vault/userconfig/kms-credentials/sa.json"
project = "<project>"
region = "global"
key_ring = "<key ring>"
crypto_key = "<key>"
}
service_registration "kubernetes" {}
# Vault UI
ui:
enabled: true
# serviceType: "LoadBalancer"
serviceNodePort: null
externalPort: 8200
# For Added Security, edit the below
#loadBalancerSourceRanges:
# - < Your IP RANGE Ex. 10.0.0.0/16 >
# - < YOUR SINGLE IP Ex. 1.78.23.3/32 >
When I first deployed vault using helm and tailed the logs for vault-0 it showed a rejoin error printing that the vault was sealed so I proceeded to initialize it:
$ kubectl exec -it vault-0 -n vault -- vault operator init
command terminated with exit code 143
This is obviously not what I expected (as I was expecting recovery keys since I used gcpckms for auto-unseal).
Now if I check the logs after that I can see where it appears to have unsealed but is getting an EOF on the TLS handshake:
==> Vault server configuration:
Api Address: https://10.64.73.30:8200
Cgo: disabled
Cluster Address: https://vault-0.vault-internal:8201
Go Version: go1.19.3
Listener 1: tcp (addr: "[::]:8200", cluster address: "[::]:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
Log Level: info
Mlock: supported: true, enabled: false
Recovery Mode: false
Storage: raft (HA available)
Version: Vault v1.12.2, built 2022-11-23T12:53:46Z
Version Sha: 415e1fe3118eebd5df6cb60d13defdc01aa17b03
==> Vault server started! Log data will stream in below:
2023-01-18T19:29:37.673Z [INFO] proxy environment: http_proxy="" https_proxy="" no_proxy=""
2023-01-18T19:29:37.673Z [WARN] storage.raft.fsm: raft FSM db file has wider permissions than needed: needed=-rw------- existing=-rw-rw----
2023-01-18T19:29:37.945Z [INFO] core: Initializing version history cache for core
2023-01-18T19:29:37.947Z [INFO] core: raft retry join initiated
2023-01-18T19:29:37.947Z [INFO] core: stored unseal keys supported, attempting fetch
2023-01-18T19:29:37.983Z [INFO] core.cluster-listener.tcp: starting listener: listener_address=[::]:8201
2023-01-18T19:29:37.983Z [INFO] core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2023-01-18T19:29:37.984Z [INFO] storage.raft: creating Raft: config="&raft.Config{ProtocolVersion:3, HeartbeatTimeout:15000000000, ElectionTimeout:15000000000, CommitTimeout:50000000, MaxAppendEntries:64, BatchApplyCh:true, ShutdownOnRemove:true, TrailingLogs:0x2800, SnapshotInterval:120000000000, SnapshotThreshold:0x2000, LeaderLeaseTimeout:2500000000, LocalID:\"vault-0\", NotifyCh:(chan<- bool)(0xc000adc1c0), LogOutput:io.Writer(nil), LogLevel:\"DEBUG\", Logger:(*hclog.interceptLogger)(0xc000990c90), NoSnapshotRestoreOnStart:true, skipStartup:false}"
2023-01-18T19:29:37.986Z [INFO] storage.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:vault-0 Address:vault-0.vault-internal:8201}]"
2023-01-18T19:29:37.986Z [INFO] storage.raft: entering follower state: follower="Node at vault-0.vault-internal:8201 [Follower]" leader-address= leader-id=
2023-01-18T19:29:37.987Z [INFO] core: entering standby mode
2023-01-18T19:29:37.987Z [WARN] storage.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2023-01-18T19:29:37.987Z [INFO] storage.raft: entering candidate state: node="Node at vault-0.vault-internal:8201 [Candidate]" term=12
2023-01-18T19:29:37.987Z [INFO] core: vault is unsealed
2023-01-18T19:29:37.992Z [INFO] storage.raft: election won: term=12 tally=1
2023-01-18T19:29:37.992Z [INFO] storage.raft: entering leader state: leader="Node at vault-0.vault-internal:8201 [Leader]"
2023-01-18T19:29:37.999Z [INFO] core: acquired lock, enabling active operation
2023-01-18T19:30:08.000Z [INFO] http: TLS handshake error from 10.64.69.20:34142: EOF
2023-01-18T19:30:08.003Z [INFO] http: TLS handshake error from 10.64.15.237:35992: EOF
2023-01-18T19:30:08.003Z [INFO] http: TLS handshake error from 10.64.72.17:34230: EOF
2023-01-18T19:30:08.007Z [INFO] http: TLS handshake error from 10.64.15.237:39660: EOF
2023-01-18T19:30:08.010Z [INFO] http: TLS handshake error from 10.64.69.20:38614: EOF
2023-01-18T19:30:08.013Z [INFO] http: TLS handshake error from 10.64.15.237:39644: EOF
2023-01-18T19:30:08.016Z [INFO] http: TLS handshake error from 10.64.15.237:59774: EOF
2023-01-18T19:30:08.018Z [INFO] http: TLS handshake error from 10.64.72.17:35624: EOF
2023-01-18T19:30:08.019Z [INFO] http: TLS handshake error from 10.64.15.237:59762: EOF
==> Vault shutdown triggered
2023-01-18T19:30:52.180Z [INFO] core: marked as sealed
Hoping someone has seen this or at least has some idea of the cause.