Boundary worker shutsdown because awskms plugin fails

I have a worker that successfully authenticates to the controller via awskms. After about 20 seconds, the worker then shutsdown with the following error message.

	* Error finalizing kms of type aead and purpose worker-auth: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin435441019: connect: connection refused"

Here are more logs:

    {
        "id": "np5TnlZQdS",
        "source": "https://hashicorp.com/boundary/deployment-boundary-577785657-ftxbh/worker",
        "specversion": "1.0",
        "type": "system",
        "data": {
            "version": "v0.1",
            "op": "worker.(Worker).upstreamDialerFunc",
            "data": {
                "msg": "worker has successfully authenticated"
            }
        },
        "datacontentype": "application/cloudevents",
        "time": "2023-02-16T20:18:11.332569253Z"
    }
    {
        "id": "LABFVH4KqP",
        "source": "https://hashicorp.com/boundary/deployment-boundary-577785657-ftxbh/worker",
        "specversion": "1.0",
        "type": "system",
        "data": {
            "version": "v0.1",
            "op": "worker.(Worker).updateAddrs",
            "data": {
                "msg": "Upstreams after first status set to: [<hostname-redacted>:9201]"
            }
        },
        "datacontentype": "application/cloudevents",
        "time": "2023-02-16T20:18:11.628239255Z"
    }
    {
        "id": "Jil2dSsgLZ",
        "source": "https://hashicorp.com/boundary/deployment-boundary-577785657-ftxbh/worker",
        "specversion": "1.0",
        "type": "system",
        "data": {
            "version": "v0.1",
            "op": "worker.(Worker).GracefulShutdown",
            "data": {
                "msg": "worker entering graceful shutdown"
            }
        },
        "datacontentype": "application/cloudevents",
        "time": "2023-02-16T20:18:33.39060176Z"
    }
    {
        "id": "eCKEco8cHI",
        "source": "https://hashicorp.com/boundary/deployment-boundary-577785657-ftxbh/worker",
        "specversion": "1.0",
        "type": "system",
        "data": {
            "version": "v0.1",
            "op": "github.com/hashicorp/boundary/internal/observability/event.(*HclogLoggerAdapter).writeEvent",
            "data": {
                "@original-log-level": "none",
                "@original-log-name": "awskms.stdio",
                "err": {},
                "msg": "received EOF, stopping recv loop",
                "purpose": "worker-auth"
            }
        },
        "datacontentype": "application/cloudevents",
        "time": "2023-02-16T20:18:33.391413818Z"
    }
    {
        "id": "nVOZKIWExw",
        "source": "https://hashicorp.com/boundary/deployment-boundary-577785657-ftxbh/worker",
        "specversion": "1.0",
        "type": "system",
        "data": {
            "version": "v0.1",
            "op": "github.com/hashicorp/boundary/internal/observability/event.(*HclogLoggerAdapter).writeEvent",
            "data": {
                "@original-log-level": "none",
                "@original-log-name": "awskms",
                "error": "signal: terminated",
                "msg": "plugin process exited",
                "path": "/tmp/3381329121/boundary-plugin-kms-awskms-kIDyD",
                "pid": 25,
                "purpose": "worker-auth"
            }
        },
        "datacontentype": "application/cloudevents",
        "time": "2023-02-16T20:18:33.391638246Z"
    }

I have not been able to find anything via google.

What does your worker KMS config look like? How about the controller?

Controller config:

    disable_mlock = true

    controller {
      name = "kubernetes-controller"
      description = "A controller for a kubernetes demo!"
      database {
        url = "env://BOUNDARY_PG_URL"
      }
      public_cluster_addr = "env://BOUNDARY_CONTROLLER_URL"
    }

    listener "tcp" {
      address = "0.0.0.0"
      purpose = "api"
      tls_disable = true
    }

    listener "tcp" {
      address = "0.0.0.0"
      purpose = "cluster"
      tls_disable = false
      tls_cert_file = "/etc/pki/tls/certs/tls.crt"
      tls_key_file = "/etc/pki/tls/certs/tls.key"
    }

    kms "aead" {
      purpose = "root"
      aead_type = "aes-gcm"
      key = "<redacted>"
      key_id = "global_root"
    }

    kms "awskms" {
      purpose = "worker-auth"
      kms_key_id = "<redacted>"
      key_id = "global_worker-auth"
    }

    kms "aead" {
      purpose = "recovery"
      aead_type = "aes-gcm"
      key = "<redacted>"
      key_id = "global_recovery"
    }

worker config:

    disable_mlock = true

    worker {
      name = "kubernetes-worker"
      description = "A worker for a kubernetes demo"
      initial_upstreams = ["<redacted>"]
      public_addr = "<redacted>"
    }

    listener "tcp" {
      address = "0.0.0.0"
      purpose = "proxy"
      tls_disable = false
      tls_cert_file = "/etc/pki/tls/certs/tls.crt"
      tls_key_file = "/etc/pki/tls/certs/tls.key"
    }

    kms "awskms" {
      purpose = "worker-auth"
      kms_key_id = "<redacted>"
      key_id = "global_worker-auth"
    }

I wonder if mixing aead keys and awskms keys is causing some sort of issue. (It shouldn’t, but it’s complaining about aead on an awskms worker-auth key, so maybe?)

What happens if you make them all aead or all awskms?

Can you post the full output from both the controller and the worker? That error message is something that happens during shutdown, but workers don’t (currently) use plugins. That makes me think that the error is on the controller side but something is getting confused.

@jimlambrt any thoughts on why we’d be seeing that “worker” source for messages coming from the controller?

Another oddity is that neither AEAD nor AWS KMS implementations use finalizers, so that error should never be seen because we should detect that there is no finalizer and not attempt to run it.

Can you also let us know what version you’re running and where you got it from?

When I attempt to start the controller with the same awskms key for the root purpose, I get this error on controller startup.

Error initializing controller: error reconciling kms keys: kms.ReconcileKeys: error creating audit key in scope global: unknown: error #0: kms.(Kms).ReconcileKeys: kms.(Kms).GetWrapper: error loading root key for scope "global": kms.(Kms).loadRoot: error looking up root key versions for scope "global": kms.(repository).ListRootKeyVersions: error decrypting key num 0: kms.(rootKeyVersion).Decrypt: unable to decrypt: error unwrapping value: rpc error: code = Unknown desc = error decrypting data: InvalidCiphertextException: 

After modifying my controller keys, they looked like:

    kms "awskms" {
      purpose = "root"
      kms_key_id = "<redacted-key>"
      key_id = "global_root"
    }

    kms "awskms" {
      purpose = "worker-auth"
      kms_key_id = "<same-as-readacted-key-above>"
      key_id = "global_worker-auth"
    }

controller.txt (66.1 KB)
worker.txt (6.4 KB)
Uploaded are the worker and controller logs

It looks like I’m running Version: Boundary v0.12.0 . I’m runnin in eks with image hashicorp/boundary:latest. Let me try a previous version.

Did you happen to initialize the database with one KMS block and then change your KMS block later? It looks like the KMS that was used for initializing Boundary is not the same as the one being used now, hence the ciphertext decryption errors.

Ah, yes. Re-creating the database with the awskms solved the controller issue.
Here are the logs with both root and worker-auth keys using awskms

Error running shutdown tasks: 2 errors occurred:
	* Error finalizing kms of type awskms and purpose root: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin51419437: connect: connection refused"
	* Error finalizing kms of type awskms and purpose worker-auth: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin1878380687: connect: connection refused"

worker-awskms.txt (15.7 KB)

Is there a way to enable more verbose logging?

When attempting to use all aead keys, the worker still shuts down, but after a longer time and without the finalizing errors from the awskms kms types.

│ Couldn't start Boundary with IPC_LOCK. Disabling IPC_LOCK, please use --privileged or --cap-add IPC_LOCK                                                                                                                                   │
│ ==> Boundary server configuration:                                                                                                                                                                                                         │
│                                                                                                                                                                                                                                            │
│           [Root] AEAD Type: aes-gcm                                                                                                                                                                                                        │
│    [Worker-Auth] AEAD Type: aes-gcm                                                                                                                                                                                                        │
│                        Cgo: disabled                                                                                                                                                                                                       │
│                 Listener 1: tcp (addr: "0.0.0.0:9202", max_request_duration: "1m30s", purpose: "proxy")                                                                                                                                    │
│                  Log Level: info                                                                                                                                                                                                           │
│                      Mlock: supported: true, enabled: false                                                                                                                                                                                │
│                    Version: Boundary v0.12.0                                                                                                                                                                                               │
│                Version Sha: 034f27ec8528251ce23b5afbf3851c48b0c54bec                                                                                                                                                                       │
│   Worker Public Proxy Addr: worker.boundary.jpdev.rescale.com:9202                                                                                                                                                                         │
│                                                                                                                                                                                                                                            │
│ ==> Boundary server started! Log data will stream in below:                                                                                                                                                                                │

│ {"id":"QWcqW6NgXd","source":"https://hashicorp.com/boundary/deployment-boundary-77884b6469-2xlkq/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).startAuthRotationTicking","data":{"msg":"using │
│ {"id":"BJQmDY5M96","source":"https://hashicorp.com/boundary/deployment-boundary-77884b6469-2xlkq/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).upstreamDialerFunc","data":{"msg":"worker has  │
│ {"id":"VDT7DrS9VC","source":"https://hashicorp.com/boundary/deployment-boundary-77884b6469-2xlkq/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).updateAddrs","data":{"msg":"Upstreams after fi │
│ ==> Boundary server graceful shutdown triggered, interrupt again to enter shutdown                                                                                                                                                         │

Hi there – those logs are truncated, so we can’t actually see the error message :slight_smile:

Woops, here you go. I enabled trace level logging, but I don’t see any more information.

Could not chown /boundary (may not have appropriate permissions)
chown: /boundary/..2023_02_22_16_56_13.151611438/boundary.hcl: Read-only file system
chown: /boundary/..2023_02_22_16_56_13.151611438: Read-only file system
chown: /boundary/..2023_02_22_16_56_13.151611438: Read-only file system
chown: /boundary/..data: Read-only file system
chown: /boundary/boundary.hcl: Read-only file system
chown: /boundary: Read-only file system
chown: /boundary: Read-only file system
Couldn't start Boundary with IPC_LOCK. Disabling IPC_LOCK, please use --privileged or --cap-add IPC_LOCK
==> Boundary server configuration:

          [Root] AEAD Type: aes-gcm
   [Worker-Auth] AEAD Type: aes-gcm
                       Cgo: disabled
                Listener 1: tcp (addr: "0.0.0.0:9202", max_request_duration: "1m30s", purpose: "proxy")
                 Log Level: trace
                     Mlock: supported: true, enabled: false
                   Version: Boundary v0.12.0
               Version Sha: 034f27ec8528251ce23b5afbf3851c48b0c54bec
  Worker Public Proxy Addr: worker.boundary.jpdev.rescale.com:9202

==> Boundary server started! Log data will stream in below:

{"id":"MKvUgy2QWx","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).startAuthRotationTicking","data":{"msg":"using kms worker authentication; pki auth rotation ticking not running"}},"datacontentype":"application/cloudevents","time":"2023-02-22T16:59:39.245711233Z"}
{"id":"md6cfIYOsE","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).upstreamDialerFunc","data":{"msg":"worker has successfully authenticated"}},"datacontentype":"application/cloudevents","time":"2023-02-22T16:59:39.557823931Z"}
{"id":"e9W0UAAf1A","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).updateAddrs","data":{"msg":"Upstreams after first status set to: [controller.boundary.jpdev.rescale.com:9201]"}},"datacontentype":"application/cloudevents","time":"2023-02-22T16:59:39.882427022Z"}
{"id":"K3GqEWi117","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).GracefulShutdown","data":{"msg":"worker entering graceful shutdown"}},"datacontentype":"application/cloudevents","time":"2023-02-22T17:00:03.47580557Z"}
==> Boundary server graceful shutdown triggered, interrupt again to enter shutdown
{"id":"1elZ0BULSt","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).GracefulShutdown","data":{"msg":"worker connections have drained"}},"datacontentype":"application/cloudevents","time":"2023-02-22T17:00:06.734552831Z"}
{"id":"XkRHe2cxzO","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).Shutdown","data":{"msg":"worker shutting down"}},"datacontentype":"application/cloudevents","time":"2023-02-22T17:00:06.734709158Z"}
{"id":"UliTi1hb9y","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).startStatusTicking","data":{"msg":"status ticking shutting down"}},"datacontentype":"application/cloudevents","time":"2023-02-22T17:00:08.736724873Z"}
{"id":"4JAAmLa6Ya","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).Shutdown","data":{"msg":"worker finished shutting down"}},"datacontentype":"application/cloudevents","time":"2023-02-22T17:00:08.73687471Z"}

Ok, I figured out the issue. This was a configuration issue with my kubernetes container specification, not an issue with the boundary configuration. Since I am using an AWS NLB, I had to change the livenessProbe from httpGet to tcpSocket.
Original:

          livenessProbe:
            httpGet:
              path: "/"
              port: 9202
          readinessProbe:
            httpGet:
              path: "/"
              port: 9202

New:

          livenessProbe:
            tcpSocket:
              port: 9202
            initialDelaySeconds: 10
            periodSeconds: 30