TLS Handshake Error and Worker Type Change After Upgrading to Version 0.14.3

I recently upgraded my self-hosted Boundary setup running on an EKS cluster from version 0.12.1 to 0.14.3

Post-upgrade, my worker nodes have started to experience TLS handshake errors, which were not present before the upgrade in version 0.12.1

Additionally, when I check the workers in UI now displays the type of worker as PKI, whereas my initial setup was configured to use KMS for encryption.

I destroyed and did a clean deployment using 0.14.3 version but still having the same issue.

I am using NLB in front of the worker nodes.

Below is my configuration for both Controller and workers:

Controller:

disable_mlock = true
log_format    = "standard"

controller {
  name        = "kubernetes-controller"
  description = "Boundary kubernetes-controller"
  database {
    url = "postgresqlurl"
  }
  public_cluster_addr = "boundary.boundary:9201"
}

listener "tcp" {
  address     = "0.0.0.0"
  purpose     = "api"
  tls_disable = true
}
listener "tcp" {
  address     = "0.0.0.0"
  purpose     = "cluster"
}
listener "tcp" {
  address     = "0.0.0.0"
  purpose     = "ops"
  tls_disable = true
}

kms "aead" {
    purpose   = "root"
    key_id    = "global_root"
    aead_type = "aes-gcm"
    key       = "rootkey"
}
kms "aead" {
    purpose   = "worker-auth"
    key_id    = "global_worker-auth"
    aead_type = "aes-gcm"
    key       = "workerkey"
}
kms "aead" {
    purpose   = "recovery"
    key_id    = "global_recovery"
    aead_type = "aes-gcm"
    key       = "recoverykey"
}

Worker:

disable_mlock = true
log_format    = "standard"

worker {
  name        = "kubernetes-worker"
  description = "Boundary kubernetes-worker"
  initial_upstreams = ["boundary.boundary:9201"]
  public_addr = "mypublicaddr.com"
}

listener "tcp" {
  address     = "0.0.0.0"
  purpose     = "proxy"
  tls_disable = true
}

kms "aead" {
    purpose   = "worker-auth"
    key_id    = "global_worker-auth"
    aead_type = "aes-gcm"
    key       = "workerkey"
}
  • the workers exhibit TLS handshake errors and fail to connect to the controller, with error logs indicating issues like “transport: Error while dialing” and “error handshaking tls connection: remote error: tls: internal error”.
  • The Boundary UI misidentifies the worker’s encryption handling from KMS to PKI.

Error logs below:

{"id":"lQtOitqwAx","source":"https://hashicorp.com/boundary/boundary-worker-66445d86b6-x9gd9/worker","specversion":"1.0","type":"error","data":{"error":"worker.rotateWorkerAuth: unknown, unknown: error #0: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: worker.(Worker).upstreamDialerFunc: unknown, unknown: error #0: (nodeenrollment.protocol.Dial) errors encountered attempting to create client tls connection: 1 error occurred:\\n\\t* error handshaking tls connection: remote error: tls: internal error\\n\\n\"","error_fields":{"Code":0,"Msg":"","Op":"worker.rotateWorkerAuth","Wrapped":{}},"id":"e_nxof15EDWA","version":"v0.1","op":"worker.rotateWorkerAuth"},"datacontentype":"application/cloudevents","time":"2024-02-03T01:38:58.625019283Z"}
{"id":"LK1uKWQTNI","source":"https://hashicorp.com/boundary/boundary-worker-66445d86b6-x9gd9/worker","specversion":"1.0","type":"error","data":{"error":"worker.rotateWorkerAuth: unknown, unknown: error #0: rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: worker.(Worker).upstreamDialerFunc: unknown, unknown: error #0: (nodeenrollment.protocol.Dial) errors encountered attempting to create client tls connection: 1 error occurred:\\n\\t* error handshaking tls connection: remote error: tls: internal error\\n\\n\"","error_fields":{"Code":0,"Msg":"","Op":"worker.rotateWorkerAuth","Wrapped":{}},"id":"e_7xsqFYF6zu","version":"v0.1","op":"worker.(Worker).startAuthRotationTicking"},"datacontentype":"application/cloudevents","time":"2024-02-03T01:38:58.625150281Z"}
{"id":"Ragdo0DSxX","source":"https://hashicorp.com/boundary/boundary-worker-66445d86b6-x9gd9/worker","specversion":"1.0","type":"error","data":{"error":"rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing: worker.(Worker).upstreamDialerFunc: unknown, unknown: error #0: (nodeenrollment.protocol.Dial) errors encountered attempting to create client tls connection: 1 error occurred:\\n\\t* error handshaking tls connection: remote error: tls: internal error\\n\\n\"","error_fields":{},"id":"e_LJfvuON6tt","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"msg":"error making status request to controller"}},"datacontentype":"application/cloudevents","time":"2024-02-03T01:38:58.879425205Z"}
{"id":"pozsmYnkwu","source":"https://hashicorp.com/boundary/boundary-worker-66445d86b6-x9gd9/worker","specversion":"1.0","type":"error","data":{"error":"status error grace period has expired, canceling all sessions on worker","error_fields":{},"id":"e_GNNuvOqBuI","version":"v0.1","op":"worker.(Worker).sendWorkerStatus","info":{"grace_period":15000000000,"last_status_time":"2024-02-03 00:41:43.213348054 +0000 UTC m=+8.620880127"}},"datacontentype":"application/cloudevents","time":"2024-02-03T01:38:58.879560498Z"}
{"id":"4J84sZw49E","source":"https://hashicorp.com/boundary/boundary-worker-66445d86b6-x9gd9/worker","specversion":"1.0","type":"error","data":{"error":"(nodeenrollment.protocol.Dial) errors encountered attempting to create client tls connection: 1 error occurred:\n\t* error handshaking tls connection: remote error: tls: internal error\n\n","error_fields":{},"id":"e_2trL7wknRf","version":"v0.1","op":"worker.(Worker).upstreamDialerFunc"},"datacontentype":"application/cloudevents","time":"2024-02-03T01:38:59.627931035Z"}

Is there anything am I missing in my configuration files between versions that would cause such issue?

EDIT: Looks like I only get this issue if I have multiple workers running in deployment. I tried with 1 worker and do not get TLS issue. Still can`t figure out why after updating version, it gives TLS issue with multiple worker pods.

@jeff @omkensey

Would you happen to know why scaling out workers won`t auto register them in to controller in Kubernetes setting when using kms based authentication after the version update?

Hello,

I’m not sure why you’re seeing those errors, although likely the controller logs will have more information (because of Go’s TLS library purposefully not divulging much information to clients issues are mostly logged on the server/controller side). You can also set the flag enable_worker_auth_debugging = true on the controller and worker configs at the top level to have them spit out more information that may help.

Starting with version 0.13.0 the distinction between kms and pki is no longer meaningful as all workers end up using the same mechanism post-registration. In 0.15.1 we’ll be removing that column from being shown in the UI.

We had this issue and I think its because the controller and worker names need to be unique. So in Kubernetes I needed to use a ENV var that was unique to the pod as the names.

as far as I understand, worker names are determined in worker configuration file which in my case i am using configmap. How did you assign different names to each worker in your configmaps?

So in my deployment I did something like this:

env:
  - name: POD_NAME
    valueFrom:
      fieldRef:
        fieldPath: metadata.name
  - name: CONTROLLER_NAME
    value: "controller-$(POD_NAME)"
  - name: WORKER_NAME
    value: "worker-$(POD_NAME)"

So then in the configmap I can just reference those env vars

name = "env://CONTROLLER_NAME"

Thanks for sharing. Did you expose workers using NLB? If so how did you manage to expose each of them?

Yeah I only just noticed its not recommended to load balance workers but it seems to work just fine with them behind a NLB? We are still experimenting with the setup and maybe there will be some better support for this in the future. But doesn’t seem ideal to try and have multiple worker pods with their own public addresses…

Using an NLB means Boundary cannot ensure proper routing from any specific ingress worker to the required egress worker. For instance, if you had two workers, one with tag w1 and one with w2 and put both behind an NLB, and end up needing to ensure a specific session comes into the worker tagged with w2, you will be unable to do so. But for simple deployments it should be fine so long as it is at the TCP level.

I should mention that NLBs are also fine if you want to e.g. reuse a single public address and have different ports map to different workers, ensuring that you configure the workers’ public addresses appropriately. That way you can get the benefits of both an NLB and tight control over ingress paths.