Boundary worker shutsdown because awskms plugin fails

rescale-jeremy · February 16, 2023, 9:24pm

I have a worker that successfully authenticates to the controller via awskms. After about 20 seconds, the worker then shutsdown with the following error message.

	* Error finalizing kms of type aead and purpose worker-auth: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin435441019: connect: connection refused"

Here are more logs:

    {
        "id": "np5TnlZQdS",
        "source": "https://hashicorp.com/boundary/deployment-boundary-577785657-ftxbh/worker",
        "specversion": "1.0",
        "type": "system",
        "data": {
            "version": "v0.1",
            "op": "worker.(Worker).upstreamDialerFunc",
            "data": {
                "msg": "worker has successfully authenticated"
            }
        },
        "datacontentype": "application/cloudevents",
        "time": "2023-02-16T20:18:11.332569253Z"
    }
    {
        "id": "LABFVH4KqP",
        "source": "https://hashicorp.com/boundary/deployment-boundary-577785657-ftxbh/worker",
        "specversion": "1.0",
        "type": "system",
        "data": {
            "version": "v0.1",
            "op": "worker.(Worker).updateAddrs",
            "data": {
                "msg": "Upstreams after first status set to: [<hostname-redacted>:9201]"
            }
        },
        "datacontentype": "application/cloudevents",
        "time": "2023-02-16T20:18:11.628239255Z"
    }
    {
        "id": "Jil2dSsgLZ",
        "source": "https://hashicorp.com/boundary/deployment-boundary-577785657-ftxbh/worker",
        "specversion": "1.0",
        "type": "system",
        "data": {
            "version": "v0.1",
            "op": "worker.(Worker).GracefulShutdown",
            "data": {
                "msg": "worker entering graceful shutdown"
            }
        },
        "datacontentype": "application/cloudevents",
        "time": "2023-02-16T20:18:33.39060176Z"
    }
    {
        "id": "eCKEco8cHI",
        "source": "https://hashicorp.com/boundary/deployment-boundary-577785657-ftxbh/worker",
        "specversion": "1.0",
        "type": "system",
        "data": {
            "version": "v0.1",
            "op": "github.com/hashicorp/boundary/internal/observability/event.(*HclogLoggerAdapter).writeEvent",
            "data": {
                "@original-log-level": "none",
                "@original-log-name": "awskms.stdio",
                "err": {},
                "msg": "received EOF, stopping recv loop",
                "purpose": "worker-auth"
            }
        },
        "datacontentype": "application/cloudevents",
        "time": "2023-02-16T20:18:33.391413818Z"
    }
    {
        "id": "nVOZKIWExw",
        "source": "https://hashicorp.com/boundary/deployment-boundary-577785657-ftxbh/worker",
        "specversion": "1.0",
        "type": "system",
        "data": {
            "version": "v0.1",
            "op": "github.com/hashicorp/boundary/internal/observability/event.(*HclogLoggerAdapter).writeEvent",
            "data": {
                "@original-log-level": "none",
                "@original-log-name": "awskms",
                "error": "signal: terminated",
                "msg": "plugin process exited",
                "path": "/tmp/3381329121/boundary-plugin-kms-awskms-kIDyD",
                "pid": 25,
                "purpose": "worker-auth"
            }
        },
        "datacontentype": "application/cloudevents",
        "time": "2023-02-16T20:18:33.391638246Z"
    }

I have not been able to find anything via google.

omkensey · February 16, 2023, 9:36pm

What does your worker KMS config look like? How about the controller?

rescale-jeremy · February 16, 2023, 9:45pm

Controller config:

    disable_mlock = true

    controller {
      name = "kubernetes-controller"
      description = "A controller for a kubernetes demo!"
      database {
        url = "env://BOUNDARY_PG_URL"
      }
      public_cluster_addr = "env://BOUNDARY_CONTROLLER_URL"
    }

    listener "tcp" {
      address = "0.0.0.0"
      purpose = "api"
      tls_disable = true
    }

    listener "tcp" {
      address = "0.0.0.0"
      purpose = "cluster"
      tls_disable = false
      tls_cert_file = "/etc/pki/tls/certs/tls.crt"
      tls_key_file = "/etc/pki/tls/certs/tls.key"
    }

    kms "aead" {
      purpose = "root"
      aead_type = "aes-gcm"
      key = "<redacted>"
      key_id = "global_root"
    }

    kms "awskms" {
      purpose = "worker-auth"
      kms_key_id = "<redacted>"
      key_id = "global_worker-auth"
    }

    kms "aead" {
      purpose = "recovery"
      aead_type = "aes-gcm"
      key = "<redacted>"
      key_id = "global_recovery"
    }

worker config:

    disable_mlock = true

    worker {
      name = "kubernetes-worker"
      description = "A worker for a kubernetes demo"
      initial_upstreams = ["<redacted>"]
      public_addr = "<redacted>"
    }

    listener "tcp" {
      address = "0.0.0.0"
      purpose = "proxy"
      tls_disable = false
      tls_cert_file = "/etc/pki/tls/certs/tls.crt"
      tls_key_file = "/etc/pki/tls/certs/tls.key"
    }

    kms "awskms" {
      purpose = "worker-auth"
      kms_key_id = "<redacted>"
      key_id = "global_worker-auth"
    }

omkensey · February 21, 2023, 2:53pm

I wonder if mixing aead keys and awskms keys is causing some sort of issue. (It shouldn’t, but it’s complaining about aead on an awskms worker-auth key, so maybe?)

What happens if you make them all aead or all awskms?

jeff · February 21, 2023, 7:15pm

Can you post the full output from both the controller and the worker? That error message is something that happens during shutdown, but workers don’t (currently) use plugins. That makes me think that the error is on the controller side but something is getting confused.

@jimlambrt any thoughts on why we’d be seeing that “worker” source for messages coming from the controller?

jeff · February 21, 2023, 7:20pm

Another oddity is that neither AEAD nor AWS KMS implementations use finalizers, so that error should never be seen because we should detect that there is no finalizer and not attempt to run it.

Can you also let us know what version you’re running and where you got it from?

rescale-jeremy · February 21, 2023, 8:17pm

When I attempt to start the controller with the same awskms key for the root purpose, I get this error on controller startup.

Error initializing controller: error reconciling kms keys: kms.ReconcileKeys: error creating audit key in scope global: unknown: error #0: kms.(Kms).ReconcileKeys: kms.(Kms).GetWrapper: error loading root key for scope "global": kms.(Kms).loadRoot: error looking up root key versions for scope "global": kms.(repository).ListRootKeyVersions: error decrypting key num 0: kms.(rootKeyVersion).Decrypt: unable to decrypt: error unwrapping value: rpc error: code = Unknown desc = error decrypting data: InvalidCiphertextException:

After modifying my controller keys, they looked like:

    kms "awskms" {
      purpose = "root"
      kms_key_id = "<redacted-key>"
      key_id = "global_root"
    }

    kms "awskms" {
      purpose = "worker-auth"
      kms_key_id = "<same-as-readacted-key-above>"
      key_id = "global_worker-auth"
    }

rescale-jeremy · February 21, 2023, 8:26pm

controller.txt (66.1 KB)
worker.txt (6.4 KB)
Uploaded are the worker and controller logs

rescale-jeremy · February 21, 2023, 9:44pm

It looks like I’m running Version: Boundary v0.12.0 . I’m runnin in eks with image hashicorp/boundary:latest. Let me try a previous version.

jeff · February 21, 2023, 9:58pm

Did you happen to initialize the database with one KMS block and then change your KMS block later? It looks like the KMS that was used for initializing Boundary is not the same as the one being used now, hence the ciphertext decryption errors.

rescale-jeremy · February 21, 2023, 10:46pm

Ah, yes. Re-creating the database with the awskms solved the controller issue.
Here are the logs with both root and worker-auth keys using awskms

Error running shutdown tasks: 2 errors occurred:
	* Error finalizing kms of type awskms and purpose root: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin51419437: connect: connection refused"
	* Error finalizing kms of type awskms and purpose worker-auth: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin1878380687: connect: connection refused"

worker-awskms.txt (15.7 KB)

Is there a way to enable more verbose logging?

rescale-jeremy · February 21, 2023, 11:13pm

When attempting to use all aead keys, the worker still shuts down, but after a longer time and without the finalizing errors from the awskms kms types.

│ Couldn't start Boundary with IPC_LOCK. Disabling IPC_LOCK, please use --privileged or --cap-add IPC_LOCK                                                                                                                                   │
│ ==> Boundary server configuration:                                                                                                                                                                                                         │
│                                                                                                                                                                                                                                            │
│           [Root] AEAD Type: aes-gcm                                                                                                                                                                                                        │
│    [Worker-Auth] AEAD Type: aes-gcm                                                                                                                                                                                                        │
│                        Cgo: disabled                                                                                                                                                                                                       │
│                 Listener 1: tcp (addr: "0.0.0.0:9202", max_request_duration: "1m30s", purpose: "proxy")                                                                                                                                    │
│                  Log Level: info                                                                                                                                                                                                           │
│                      Mlock: supported: true, enabled: false                                                                                                                                                                                │
│                    Version: Boundary v0.12.0                                                                                                                                                                                               │
│                Version Sha: 034f27ec8528251ce23b5afbf3851c48b0c54bec                                                                                                                                                                       │
│   Worker Public Proxy Addr: worker.boundary.jpdev.rescale.com:9202                                                                                                                                                                         │
│                                                                                                                                                                                                                                            │
│ ==> Boundary server started! Log data will stream in below:                                                                                                                                                                                │

│ {"id":"QWcqW6NgXd","source":"https://hashicorp.com/boundary/deployment-boundary-77884b6469-2xlkq/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).startAuthRotationTicking","data":{"msg":"using │
│ {"id":"BJQmDY5M96","source":"https://hashicorp.com/boundary/deployment-boundary-77884b6469-2xlkq/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).upstreamDialerFunc","data":{"msg":"worker has  │
│ {"id":"VDT7DrS9VC","source":"https://hashicorp.com/boundary/deployment-boundary-77884b6469-2xlkq/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).updateAddrs","data":{"msg":"Upstreams after fi │
│ ==> Boundary server graceful shutdown triggered, interrupt again to enter shutdown                                                                                                                                                         │

jeff · February 22, 2023, 4:46pm

Hi there – those logs are truncated, so we can’t actually see the error message

rescale-jeremy · February 22, 2023, 5:00pm

Woops, here you go. I enabled trace level logging, but I don’t see any more information.

Could not chown /boundary (may not have appropriate permissions)
chown: /boundary/..2023_02_22_16_56_13.151611438/boundary.hcl: Read-only file system
chown: /boundary/..2023_02_22_16_56_13.151611438: Read-only file system
chown: /boundary/..2023_02_22_16_56_13.151611438: Read-only file system
chown: /boundary/..data: Read-only file system
chown: /boundary/boundary.hcl: Read-only file system
chown: /boundary: Read-only file system
chown: /boundary: Read-only file system
Couldn't start Boundary with IPC_LOCK. Disabling IPC_LOCK, please use --privileged or --cap-add IPC_LOCK
==> Boundary server configuration:

          [Root] AEAD Type: aes-gcm
   [Worker-Auth] AEAD Type: aes-gcm
                       Cgo: disabled
                Listener 1: tcp (addr: "0.0.0.0:9202", max_request_duration: "1m30s", purpose: "proxy")
                 Log Level: trace
                     Mlock: supported: true, enabled: false
                   Version: Boundary v0.12.0
               Version Sha: 034f27ec8528251ce23b5afbf3851c48b0c54bec
  Worker Public Proxy Addr: worker.boundary.jpdev.rescale.com:9202

==> Boundary server started! Log data will stream in below:

{"id":"MKvUgy2QWx","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).startAuthRotationTicking","data":{"msg":"using kms worker authentication; pki auth rotation ticking not running"}},"datacontentype":"application/cloudevents","time":"2023-02-22T16:59:39.245711233Z"}
{"id":"md6cfIYOsE","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).upstreamDialerFunc","data":{"msg":"worker has successfully authenticated"}},"datacontentype":"application/cloudevents","time":"2023-02-22T16:59:39.557823931Z"}
{"id":"e9W0UAAf1A","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).updateAddrs","data":{"msg":"Upstreams after first status set to: [controller.boundary.jpdev.rescale.com:9201]"}},"datacontentype":"application/cloudevents","time":"2023-02-22T16:59:39.882427022Z"}
{"id":"K3GqEWi117","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).GracefulShutdown","data":{"msg":"worker entering graceful shutdown"}},"datacontentype":"application/cloudevents","time":"2023-02-22T17:00:03.47580557Z"}
==> Boundary server graceful shutdown triggered, interrupt again to enter shutdown
{"id":"1elZ0BULSt","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).GracefulShutdown","data":{"msg":"worker connections have drained"}},"datacontentype":"application/cloudevents","time":"2023-02-22T17:00:06.734552831Z"}
{"id":"XkRHe2cxzO","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).Shutdown","data":{"msg":"worker shutting down"}},"datacontentype":"application/cloudevents","time":"2023-02-22T17:00:06.734709158Z"}
{"id":"UliTi1hb9y","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).startStatusTicking","data":{"msg":"status ticking shutting down"}},"datacontentype":"application/cloudevents","time":"2023-02-22T17:00:08.736724873Z"}
{"id":"4JAAmLa6Ya","source":"https://hashicorp.com/boundary/deployment-boundary-cc8bb4696-v4ht9/worker","specversion":"1.0","type":"system","data":{"version":"v0.1","op":"worker.(Worker).Shutdown","data":{"msg":"worker finished shutting down"}},"datacontentype":"application/cloudevents","time":"2023-02-22T17:00:08.73687471Z"}

rescale-jeremy · February 22, 2023, 6:27pm

Ok, I figured out the issue. This was a configuration issue with my kubernetes container specification, not an issue with the boundary configuration. Since I am using an AWS NLB, I had to change the livenessProbe from httpGet to tcpSocket.
Original:

          livenessProbe:
            httpGet:
              path: "/"
              port: 9202
          readinessProbe:
            httpGet:
              path: "/"
              port: 9202

New:

          livenessProbe:
            tcpSocket:
              port: 9202
            initialDelaySeconds: 10
            periodSeconds: 30

Topic		Replies	Views
Boundary setup initial auth failing when using AWS kms Boundary	7	621	September 28, 2022
Boundary worker not connecting and getting odd message in log Boundary	14	821	April 4, 2023
Worker unresponsive after sometime Boundary k8s	1	453	November 10, 2022
Target connection issues via Boundary Desktop [ server closed the connection unexpectedly ] Boundary	6	908	April 12, 2023
Error: Failed to Handshake the request Boundary boundary	4	1768	February 16, 2022

Boundary worker shutsdown because awskms plugin fails

Related topics