I recently upgraded my self-hosted Boundary setup running on an EKS cluster from version 0.12.1 to 0.14.3
Post-upgrade, my worker nodes have started to experience TLS handshake errors, which were not present before the upgrade in version 0.12.1
Additionally, when I check the workers in UI now displays the type of worker as PKI, whereas my initial setup was configured to use KMS for encryption.
I destroyed and did a clean deployment using 0.14.3 version but still having the same issue.
I am using NLB in front of the worker nodes.
Below is my configuration for both Controller and workers:
Controller:
disable_mlock = true
log_format = "standard"
controller {
name = "kubernetes-controller"
description = "Boundary kubernetes-controller"
database {
url = "postgresqlurl"
}
public_cluster_addr = "boundary.boundary:9201"
}
listener "tcp" {
address = "0.0.0.0"
purpose = "api"
tls_disable = true
}
listener "tcp" {
address = "0.0.0.0"
purpose = "cluster"
}
listener "tcp" {
address = "0.0.0.0"
purpose = "ops"
tls_disable = true
}
kms "aead" {
purpose = "root"
key_id = "global_root"
aead_type = "aes-gcm"
key = "rootkey"
}
kms "aead" {
purpose = "worker-auth"
key_id = "global_worker-auth"
aead_type = "aes-gcm"
key = "workerkey"
}
kms "aead" {
purpose = "recovery"
key_id = "global_recovery"
aead_type = "aes-gcm"
key = "recoverykey"
}
Worker:
disable_mlock = true
log_format = "standard"
worker {
name = "kubernetes-worker"
description = "Boundary kubernetes-worker"
initial_upstreams = ["boundary.boundary:9201"]
public_addr = "mypublicaddr.com"
}
listener "tcp" {
address = "0.0.0.0"
purpose = "proxy"
tls_disable = true
}
kms "aead" {
purpose = "worker-auth"
key_id = "global_worker-auth"
aead_type = "aes-gcm"
key = "workerkey"
}
- the workers exhibit TLS handshake errors and fail to connect to the controller, with error logs indicating issues like “transport: Error while dialing” and “error handshaking tls connection: remote error: tls: internal error”.
- The Boundary UI misidentifies the worker’s encryption handling from KMS to PKI.
Error logs below:
{"id":"lQtOitqwAx","source":"https://hashicorp.com/boundary/boundary-worker-66445d86b6-x9gd9/worker","specversion":"1.0","type":"error","data":{"error":"worker.rotateWorkerAuth: unknown, unknown: error #0: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: worker.(Worker).upstreamDialerFunc: unknown, unknown: error #0: (nodeenrollment.protocol.Dial) errors encountered attempting to create client tls connection: 1 error occurred:\\n\\t* error handshaking tls connection: remote error: tls: internal error\\n\\n\"","error_fields":{"Code":0,"Msg":"","Op":"worker.rotateWorkerAuth","Wrapped":{}},"id":"e_nxof15EDWA","version":"v0.1","op":"worker.rotateWorkerAuth"},"datacontentype":"application/cloudevents","time":"2024-02-03T01:38:58.625019283Z"}
{"id":"LK1uKWQTNI","source":"https://hashicorp.com/boundary/boundary-worker-66445d86b6-x9gd9/worker","specversion":"1.0","type":"error","data":{"error":"worker.rotateWorkerAuth: unknown, unknown: error #0: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: worker.(Worker).upstreamDialerFunc: unknown, unknown: error #0: (nodeenrollment.protocol.Dial) errors encountered attempting to create client tls connection: 1 error occurred:\\n\\t* error handshaking tls connection: remote error: tls: internal error\\n\\n\"","error_fields":{"Code":0,"Msg":"","Op":"worker.rotateWorkerAuth","Wrapped":{}},"id":"e_7xsqFYF6zu","version":"v0.1","op":"worker.(Worker).startAuthRotationTicking"},"datacontentype":"application/cloudevents","time":"2024-02-03T01:38:58.625150281Z"}
{"id":"Ragdo0DSxX","source":"https://hashicorp.com/boundary/boundary-worker-66445d86b6-x9gd9/worker","specversion":"1.0","type":"error","data":{"error":"rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: worker.(Worker).upstreamDialerFunc: unknown, unknown: error #0: (nodeenrollment.protocol.Dial) errors encountered attempting to create client tls connection: 1 error occurred:\\n\\t* error handshaking tls connection: remote error: tls: internal error\\n\\n\"","error_fields":{},"id":"e_LJfvuON6tt","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"msg":"error making status request to controller"}},"datacontentype":"application/cloudevents","time":"2024-02-03T01:38:58.879425205Z"}
{"id":"pozsmYnkwu","source":"https://hashicorp.com/boundary/boundary-worker-66445d86b6-x9gd9/worker","specversion":"1.0","type":"error","data":{"error":"status error grace period has expired, canceling all sessions on worker","error_fields":{},"id":"e_GNNuvOqBuI","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"grace_period":15000000000,"last_status_time":"2024-02-03 00:41:43.213348054 +0000 UTC m=+8.620880127"}},"datacontentype":"application/cloudevents","time":"2024-02-03T01:38:58.879560498Z"}
{"id":"4J84sZw49E","source":"https://hashicorp.com/boundary/boundary-worker-66445d86b6-x9gd9/worker","specversion":"1.0","type":"error","data":{"error":"(nodeenrollment.protocol.Dial) errors encountered attempting to create client tls connection: 1 error occurred:\n\t* error handshaking tls connection: remote error: tls: internal error\n\n","error_fields":{},"id":"e_2trL7wknRf","version":"v0.1","op":"worker.(Worker).upstreamDialerFunc"},"datacontentype":"application/cloudevents","time":"2024-02-03T01:38:59.627931035Z"}
Is there anything am I missing in my configuration files between versions that would cause such issue?
EDIT: Looks like I only get this issue if I have multiple workers running in deployment. I tried with 1 worker and do not get TLS issue. Still can`t figure out why after updating version, it gives TLS issue with multiple worker pods.