Vault TLS errors preventing vault from going into active mode

xmikelopez · April 28, 2021, 7:37pm

Hello,

We are encountering a strange problem with our vault cluster in which vault does not go into active mode and throws some TLS errors and I’m at a bit of a loss on what is going on. This cluster uses AWS dynamodb as a backend. Our config is as follows:

==> Vault server configuration:

       AWS KMS KeyID: <KMS_ID>
      AWS KMS Region: us-east-1
          HA Storage: consul
           Seal Type: awskms
         Api Address: https://<address>:8200
                 Cgo: disabled
     Cluster Address: https://<address>:8201
          Listener 1: tcp (addr: "172.21.32.10:8200", cluster address: "172.21.32.10:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
           Log Level: debug
               Mlock: supported: true, enabled: false
       Recovery Mode: false
             Storage: dynamodb
             Version: Vault v1.3.3

The TLS error we are getting is as follows:

2021-04-28T15:25:26.043-0400 [INFO] proxy environment: http_proxy= https_proxy= no_proxy=
2021-04-28T15:25:26.155-0400 [DEBUG] config path set: path=vault
2021-04-28T15:25:26.155-0400 [WARN] appending trailing forward slash to path
2021-04-28T15:25:26.155-0400 [DEBUG] config disable_registration set: disable_registration=false
2021-04-28T15:25:26.155-0400 [DEBUG] config service set: service=vault
2021-04-28T15:25:26.155-0400 [DEBUG] config service_tags set: service_tags=
2021-04-28T15:25:26.155-0400 [DEBUG] config service_address set: service_address=
2021-04-28T15:25:26.155-0400 [DEBUG] config address set: address=127.0.0.1:8500
2021-04-28T15:25:26.155-0400 [DEBUG] storage.cache: creating LRU cache: size=0
2021-04-28T15:25:26.156-0400 [DEBUG] cluster listener addresses synthesized: cluster_addresses=[172.21.32.10:8201]
2021-04-28T15:25:26.162-0400 [INFO] core: stored unseal keys supported, attempting fetch
2021-04-28T15:25:26.194-0400 [DEBUG] core: unseal key supplied
2021-04-28T15:25:26.204-0400 [DEBUG] core: starting cluster listeners
2021-04-28T15:25:26.204-0400 [INFO] core.cluster-listener: starting listener: listener_address=172.21.32.10:8201
2021-04-28T15:25:26.204-0400 [INFO] core.cluster-listener: serving cluster requests: cluster_listen_address=172.21.32.10:8201
2021-04-28T15:25:26.204-0400 [INFO] core: entering standby mode
2021-04-28T15:25:26.207-0400 [INFO] core: vault is unsealed
2021-04-28T15:25:26.207-0400 [INFO] core: unsealed with stored keys: stored_keys_used=1
2021-04-28T15:25:26.740-0400 [DEBUG] core: parsing information for new active node: active_cluster_addr=https://:8201 active_redirect_addr=https://:8200
2021-04-28T15:25:26.740-0400 [DEBUG] core: refreshing forwarding connection
2021-04-28T15:25:26.740-0400 [DEBUG] core: clearing forwarding clients
2021-04-28T15:25:26.740-0400 [DEBUG] core: done clearing forwarding clients
2021-04-28T15:25:26.740-0400 [DEBUG] core: done refreshing forwarding connection
2021-04-28T15:25:26.740-0400 [DEBUG] core: creating rpc dialer: host=fw-c9349236-9c5d-5c26-13c1-1a1cce4bd848
2021-04-28T15:25:26.745-0400 [WARN] core.cluster-listener: no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
2021-04-28T15:25:26.745-0400 [DEBUG] core.cluster-listener: error handshaking cluster connection: error=“unsupported protocol”
2021-04-28T15:25:26.745-0400 [ERROR] core: error during forwarded RPC request: error=“rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = “transport: Error while dialing remote error: tls: internal error””
2021-04-28T15:25:26.745-0400 [ERROR] core: forward request error: error=“error during forwarding RPC request”
2021-04-28T15:25:26.746-0400 [DEBUG] core: forwarding: error sending echo request to active node: error=“rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = “transport: Error while dialing remote error: tls: internal error””
2021-04-28T15:25:26.819-0400 [ERROR] core: error during forwarded RPC request: error=“rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = “transport: Error while dialing remote error: tls: internal error””

consul client log shows this:

2021/04/28 14:54:59 [INFO] agent: (LAN) joined: 5 Err:
2021/04/28 14:54:59 [INFO] agent: Join LAN completed. Synced with 5 initial agents
2021/04/28 14:55:01 [INFO] agent: Synced node info
2021/04/28 14:55:20 [INFO] agent: Synced service “vault:8200”
2021/04/28 14:55:20 [INFO] agent: Synced check “vault:8200:vault-sealed-check”
2021/04/28 14:55:20 [INFO] agent: Synced check “vault:8200:vault-sealed-check”
2021/04/28 14:56:33 [INFO] agent: Deregistered service “vault:8200”
2021/04/28 14:56:34 [INFO] agent: Deregistered check “vault:8200:vault-sealed-check”
2021/04/28 14:59:34 [ERR] http: Request PUT /v1/agent/check/pass/vault:8200:vault-sealed-check?note=Vault+Unsealed, error: CheckID “vault:8200:vault-sealed-check” does not have associated TTL from=127.0.0.1:57098
2021/04/28 14:59:34 [INFO] agent: Synced service “vault:8200”
2021/04/28 14:59:34 [INFO] agent: Synced check “vault:8200:vault-sealed-check”
2021/04/28 14:59:35 [INFO] agent: Synced check “vault:8200:vault-sealed-check”

The SSL certs we use appear to be okay(we used vault to generate them).

I know our version is a bit older being v1.3.3 but we use this version in other environments with no issue.

Has anybody come across this before?

xmikelopez · April 28, 2021, 7:48pm

I should probably give my config which is as follows:

cat /etc/vault.d/vault_main.hcl
cluster_name = “awseast”
max_lease_ttl = “768h”
default_lease_ttl = “768h”
api_addr = “https://vault.service.awseast.consulstage:8200”
#api_addr = “https://172.21.32.10:8200”
disable_mlock = true
ui = true

listener “tcp” {
address = “172.21.32.10:8200”
tls_cert_file = “/etc/vault.d/tls/vault.crt”
tls_key_file = “/etc/vault.d/tls/vault.key”
tls_min_version = “tls12”
tls_cipher_suites = “TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA”
tls_prefer_server_cipher_suites = “false”
tls_disable = “false”
}

storage “dynamodb” {
ha_enabled = “true”
region = “us-east-1”
table = “<table_name>”
}

ha_storage “consul” {
address = “127.0.0.1:8500”
path = “vault”
}

seal “awskms” {
region = “us-east-1”
kms_key_id = “<“KMS_ID>
}

mikegreen · May 3, 2021, 5:35pm

Is this the load balancing URL (that round robins to each of your 3 nodes) or a unique URL to the node?
I’m assuming you have a 3x instance cluster.