Boundary database migrate OOMKilled

I am trying to run boundary inside my K8s cluster, but my init-container which runs boundary database migrate -config /boundary/config.hcl || boundary database init -skip-auth-method-creation -skip-host-resources-creation -skip-scopes-creation -skip-target-creation -config /boundary/config.hcl succeeds the init but gets OOMKilled on the migrate. I have dedicated 8G and 4CPU to the script but it still gets killed.
Any help/advice appreciated.

What does your manifest for the Deployment/Job look like?

apiVersion: apps/v1
kind: Deployment
metadata:
  name: boundary
spec:
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: boundary
  template:
    metadata:
      labels:
        app.kubernetes.io/name: boundary
      annotations:
        vault.security.banzaicloud.io/vault-env-daemon: "true"
    spec:
      serviceAccount: boundary
      securityContext:
        fsGroup: 1000
      initContainers:
        - name: boundary-init
          image: hashicorp/boundary:latest
          command:
            - /bin/sh
            - "-c"
          args:
            - boundary database migrate -config /boundary/config.hcl || boundary database init -skip-auth-method-creation -skip-host-resources-creation -skip-scopes-creation -skip-target-creation -config /boundary/config.hcl || sleep 10000
          env:
            - name: HOSTNAME
              value: boundary
            - name: BOUNDARY_POSTGRES_URL
              value: postgresql://${vault:database/creds/boundary-db#username}:${vault:database/creds/boundary-db#password}@boundary-db.infra.local:5432/boundary"
            - name: VAULT_TOKEN
              value: vault:login
          volumeMounts:
            - name: boundary-config
              mountPath: /boundary
              readOnly: true
            - name: vault-tls
              mountPath: /etc/ssl/certs/ca.crt
              subPath: ca.crt
              readOnly: false
      containers:
        - name: boundary-server
          image: hashicorp/boundary:latest
          command:
            - /bin/sh
            - "-c"
          args:
            - boundary server -config /boundary/boundary-config.hcl
          env:
            - name: HOSTNAME
              value: boundary
            - name: BOUNDARY_POSTGRES_URL
              value: postgresql://${vault:database/creds/boundary-db#username}:${vault:database/creds/boundary-db#password}@boundary-db.infra.local:5432/boundary
            - name: VAULT_TOKEN
              value: vault:login
          resources:
            limits:
              memory: "128Mi"
              cpu: "500m"
          ports:
          - containerPort: 9200
            name: api
          - containerPort: 9201
            name: cluster
          - containerPort: 9202
            name: data
          livenessProbe:
            httpGet:
              path: /
              port: api
          readinessProbe:
            httpGet:
              path: /
              port: api
          volumeMounts:
            - name: boundary-config
              mountPath: /boundary
              readOnly: true
            - name: vault-tls
              mountPath: /etc/ssl/certs/ca.crt
              subPath: ca.crt
              readOnly: false
          securityContext:
            capabilities:
              drop:
                - ALL
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            runAsUser: 100
            runAsGroup: 1000
      volumes:
        - name: boundary-config
          configMap:
            name: boundary-config
        - name: vault-tls
          secret:
            secretName: vault-tls
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app.kubernetes.io/name
                    operator: In
                    values:
                      - boundary
              topologyKey: kubernetes.io/hostname

Only thing I can think is that something either inside Kubernetes (like a namespace ResourceQuota) or outside it like a really restrictive cgroup on the node is putting a memory limit on containers running there. What does memory usage on the node itself look like? What kind of k8s cluster is it? kubectl describe on the OOMkilled pod, or kubectl get events, might give useful info.

I use a kOps created cluster (v1.22.6) using containerd, have no ResourceQuotas set, we have no issues with other apps using more memory. Logs indicate that it crashes during the migrate not during the init. Describing the pod shows me:

  boundary-init:
    Container ID:  containerd://d1d0455d29153512f384646456a1457225d9fe46934382de519a885fca8a0d25
    Image:         hashicorp/boundary:0.7.6
    Image ID:      docker.io/hashicorp/boundary@sha256:edde120eb2db1873fe1bce587c20fe6a6e896ce3127a5abb7b1c36b49f76f68a
    Port:          <none>
    Host Port:     <none>
    Command:
      /vault/vault-env
    Args:
      /bin/sh
      -c
      boundary database migrate -config /boundary/config.hcl || boundary database init -skip-auth-method-creation -skip-host-resources-creation -skip-scopes-creation -skip-target-creation -config /boundary/config.hcl
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    2
      Started:      Fri, 01 Apr 2022 08:19:23 -0500
      Finished:     Fri, 01 Apr 2022 08:19:25 -0500
    Ready:          False
    Restart Count:  764
    Requests:
      cpu:     1
      memory:  4G

And get events doesn’t give me anything else more useful. The node appears to average under 20% resource utilization.
Trying to capture the memory usage has been difficult as it crashes too quickly for node-exporter to capture the metrics.

The one thing that jumps out at me there is the use of vault-env. That’s the BanzaiCloud pipeline tool for Vault, right?

What does kubectl logs on the containers in the pod tell you? Do you get any output from the boundary database migrate command (i.e. is it getting as far as starting that up)? From vault-env before or after that?

I think the next thing to try is running the migrate command directly – hopefully either you won’t have this issue at all, or you might at least get some log or event output that will tell you more about what’s going on.

(If you’re using vault-env because of concerns around sensitive values in Boundary config data, you can create a config KMS key in a supported KMS like Vault Transit and use it with boundary config encrypt to encrypt those values – then give the controllers and workers access to that same key, and they will decrypt those values on startup.)

Yes vault-env is from BanzaiCloud, and yes it is working successfully. I am able to see the logs (will provide below). The vault-env process is successful bc when I exec into the pod I can see the secrets in the running process’ environment. It fails almost immediately on the database migrate but bc I have it in a bash script it moves onto the database init again which fails bc the database has already been initialized.

The first 12 lines are vault-env specific:

➜  ~ (⎈ k8s.infra.local:default) kubectl logs boundary-98d5df9c7-6ncmj -f -c boundary-init
time="2022-04-04T13:13:04Z" level=info msg="received new Vault token" addr= app=vault-env path=kubernetes role=boundary
time="2022-04-04T13:13:04Z" level=info msg="initial Vault token arrived" app=vault-env
time="2022-04-04T13:13:04Z" level=info msg="renewed Vault token" app=vault-env ttl=1h0m0s
time="2022-04-04T13:13:04Z" level=info msg="secret database/creds/boundary-db has a lease duration of 3600s, starting renewal" app=vault-env
time="2022-04-04T13:13:04Z" level=info msg="spawning process: [/bin/sh -c boundary database migrate -config /boundary/config.hcl || boundary database init -skip-auth-method-creation -skip-host-resources-creation -skip-scopes-creation -skip-target-creation -config /boundary/config.hcl]" app=vault-env
time="2022-04-04T13:13:04Z" level=info msg="in daemon mode..." app=vault-env
time="2022-04-04T13:13:04Z" level=info msg="received signal: urgent I/O condition" app=vault-env
time="2022-04-04T13:13:04Z" level=info msg="secret database/creds/boundary-db renewed for 3600s" app=vault-env
time="2022-04-04T13:13:05Z" level=info msg="received signal: urgent I/O condition" app=vault-env
time="2022-04-04T13:13:05Z" level=error msg="watcher error" app=vault-env err="fsnotify queue overflow"
time="2022-04-04T13:13:05Z" level=error msg="watcher error" app=vault-env err="fsnotify queue overflow"
time="2022-04-04T13:13:05Z" level=info msg="received signal: urgent I/O condition" app=vault-env
Killed
{"id":"sjAj5IzOr1","source":"https://hashicorp.com/boundary/kubernetes-controller/boundary-database-init","specversion":"1.0","type":"error","data":{"error":"postgres.(Postgres).EnsureVersionTable: unknown, unknown: error #0: ERROR: relation \"boundary_schema_version\" already exists (SQLSTATE 42P07)","error_fields":{"Code":0,"Msg":"","Op":"postgres.(Postgres).EnsureVersionTable","Wrapped":{"Severity":"ERROR","Code":"42P07","Message":"relation \"boundary_schema_version\" already exists","Detail":"","Hint":"","Position":0,"InternalPosition":0,"InternalQuery":"","Where":"","SchemaName":"","TableName":"","ColumnName":"","DataTypeName":"","ConstraintName":"","File":"heap.c","Line":1182,"Routine":"heap_create_with_catalog"}},"id":"e_vqSya0w0k6","version":"v0.1","op":"postgres.(Postgres).EnsureVersionTable"},"datacontentype":"application/cloudevents","time":"2022-04-04T13:13:06.177090241Z"}
{"id":"rmYllucWBD","source":"https://hashicorp.com/boundary/kubernetes-controller/boundary-database-init","specversion":"1.0","type":"error","data":{"error":"schema.(Manager).runMigrations: postgres.(Postgres).EnsureVersionTable: unknown, unknown: error #0: ERROR: relation \"boundary_schema_version\" already exists (SQLSTATE 42P07)","error_fields":{"Code":0,"Msg":"","Op":"schema.(Manager).runMigrations","Wrapped":{"Code":0,"Msg":"","Op":"postgres.(Postgres).EnsureVersionTable","Wrapped":{"Severity":"ERROR","Code":"42P07","Message":"relation \"boundary_schema_version\" already exists","Detail":"","Hint":"","Position":0,"InternalPosition":0,"InternalQuery":"","Where":"","SchemaName":"","TableName":"","ColumnName":"","DataTypeName":"","ConstraintName":"","File":"heap.c","Line":1182,"Routine":"heap_create_with_catalog"}}},"id":"e_ajbz8qvxiX","version":"v0.1","op":"schema.(Manager).runMigrations"},"datacontentype":"application/cloudevents","time":"2022-04-04T13:13:06.177686091Z"}
{"id":"gB44Rl4KfF","source":"https://hashicorp.com/boundary/kubernetes-controller/boundary-database-init","specversion":"1.0","type":"error","data":{"error":"postgres.(Postgres).CommitRun: unknown, unknown: error #0: commit unexpectedly resulted in rollback","error_fields":{"Code":0,"Msg":"","Op":"postgres.(Postgres).CommitRun","Wrapped":{}},"id":"e_WuP8RGqnZH","version":"v0.1","op":"postgres.(Postgres).CommitRun"},"datacontentype":"application/cloudevents","time":"2022-04-04T13:13:06.178094475Z"}
{"id":"W3oDIbGgyE","source":"https://hashicorp.com/boundary/kubernetes-controller/boundary-database-init","specversion":"1.0","type":"error","data":{"error":"schema.(Manager).runMigrations: postgres.(Postgres).CommitRun: unknown, unknown: error #0: commit unexpectedly resulted in rollback","error_fields":{"Code":0,"Msg":"","Op":"schema.(Manager).runMigrations","Wrapped":{"Code":0,"Msg":"","Op":"postgres.(Postgres).CommitRun","Wrapped":{}}},"id":"e_ka9L9lUS7X","version":"v0.1","op":"schema.(Manager).runMigrations"},"datacontentype":"application/cloudevents","time":"2022-04-04T13:13:06.178306422Z"}
{"id":"UHUAKbnG7i","source":"https://hashicorp.com/boundary/kubernetes-controller/boundary-database-init","specversion":"1.0","type":"error","data":{"error":"schema.(Manager).ApplyMigrations: schema.(Manager).runMigrations: postgres.(Postgres).CommitRun: unknown, unknown: error #0: commit unexpectedly resulted in rollback","error_fields":{"Code":0,"Msg":"","Op":"schema.(Manager).ApplyMigrations","Wrapped":{"Code":0,"Msg":"","Op":"schema.(Manager).runMigrations","Wrapped":{"Code":0,"Msg":"","Op":"postgres.(Postgres).CommitRun","Wrapped":{}}}},"id":"e_mkpSkO4P5W","version":"v0.1","op":"schema.(Manager).ApplyMigrations"},"datacontentype":"application/cloudevents","time":"2022-04-04T13:13:06.178480227Z"}
Error running database migrations: schema.(Manager).ApplyMigrations: schema.(Manager).runMigrations: postgres.(Postgres).CommitRun: unknown, unknown: error #0: commit unexpectedly resulted in rollback

It looks like you’re running migrate before init counting on it to fail without doing anything to an empty database, so that init will then pick up and successfully init the still-empty DB after the || operator – but I’m not sure that’s guaranteed or even intended to work. (I do know that init will exit with a “did nothing” success code on an already inited database.)

Let’s simplify the scenario here. What happens if you run this without trying to do the migrate || init control structure? Does init by itself succeed on your empty database? Does migrate by itself succeed on a previously-inited database?

init does work the first time and exists with success message. Any subsequent runs of init fail with error about in my previous comment.
The migrate command always gets OOMKilled with no logs.

Ah, I just noticed this:

          resources:
            limits:
              memory: "128Mi"

in your manifest. Take that out and you’ll likely get better behavior.

Those resource limits are applied to the main container in the pod not the boundary-init container, so I am surprised that by removing them it no longer gets killed. It now logs this:

Database has not been initialized. Please use 'boundary database init' to
initialize the boundary database.
time="2022-04-06T12:50:36Z" level=info msg="received signal: urgent I/O condition" app=vault-env
{"id":"fy7H6WGjLu","source":"https://hashicorp.com/boundary/kubernetes-controller/boundary-database-init","specversion":"1.0","type":"error","data":{"error":"postgres.(Postgres).EnsureVersionTable: unknown, unknown: error #0: ERROR: relation \"boundary_schema_version\" already exists (SQLSTATE 42P07)","error_fields":{"Code":0,"Msg":"","Op":"postgres.(Postgres).EnsureVersionTable","Wrapped":{"Severity":"ERROR","Code":"42P07","Message":"relation \"boundary_schema_version\" already exists","Detail":"","Hint":"","Position":0,"InternalPosition":0,"InternalQuery":"","Where":"","SchemaName":"","TableName":"","ColumnName":"","DataTypeName":"","ConstraintName":"","File":"heap.c","Line":1182,"Routine":"heap_create_with_catalog"}},"id":"e_mGfgjWKuWP","version":"v0.1","op":"postgres.(Postgres).EnsureVersionTable"},"datacontentype":"application/cloudevents","time":"2022-04-06T12:50:36.988993925Z"}
{"id":"exDeVuhXEd","source":"https://hashicorp.com/boundary/kubernetes-controller/boundary-database-init","specversion":"1.0","type":"error","data":{"error":"schema.(Manager).runMigrations: postgres.(Postgres).EnsureVersionTable: unknown, unknown: error #0: ERROR: relation \"boundary_schema_version\" already exists (SQLSTATE 42P07)","error_fields":{"Code":0,"Msg":"","Op":"schema.(Manager).runMigrations","Wrapped":{"Code":0,"Msg":"","Op":"postgres.(Postgres).EnsureVersionTable","Wrapped":{"Severity":"ERROR","Code":"42P07","Message":"relation \"boundary_schema_version\" already exists","Detail":"","Hint":"","Position":0,"InternalPosition":0,"InternalQuery":"","Where":"","SchemaName":"","TableName":"","ColumnName":"","DataTypeName":"","ConstraintName":"","File":"heap.c","Line":1182,"Routine":"heap_create_with_catalog"}}},"id":"e_D66ae2Io92","version":"v0.1","op":"schema.(Manager).runMigrations"},"datacontentype":"application/cloudevents","time":"2022-04-06T12:50:36.98943749Z"}
{"id":"LuFdRKh3tE","source":"https://hashicorp.com/boundary/kubernetes-controller/boundary-database-init","specversion":"1.0","type":"error","data":{"error":"postgres.(Postgres).CommitRun: unknown, unknown: error #0: commit unexpectedly resulted in rollback","error_fields":{"Code":0,"Msg":"","Op":"postgres.(Postgres).CommitRun","Wrapped":{}},"id":"e_OaR1VBlkho","version":"v0.1","op":"postgres.(Postgres).CommitRun"},"datacontentype":"application/cloudevents","time":"2022-04-06T12:50:36.989947246Z"}
{"id":"m6FgmjjgrZ","source":"https://hashicorp.com/boundary/kubernetes-controller/boundary-database-init","specversion":"1.0","type":"error","data":{"error":"schema.(Manager).runMigrations: postgres.(Postgres).CommitRun: unknown, unknown: error #0: commit unexpectedly resulted in rollback","error_fields":{"Code":0,"Msg":"","Op":"schema.(Manager).runMigrations","Wrapped":{"Code":0,"Msg":"","Op":"postgres.(Postgres).CommitRun","Wrapped":{}}},"id":"e_YgLhGw7q63","version":"v0.1","op":"schema.(Manager).runMigrations"},"datacontentype":"application/cloudevents","time":"2022-04-06T12:50:36.990116792Z"}
{"id":"pjwvuhZbP8","source":"https://hashicorp.com/boundary/kubernetes-controller/boundary-database-init","specversion":"1.0","type":"error","data":{"error":"schema.(Manager).ApplyMigrations: schema.(Manager).runMigrations: postgres.(Postgres).CommitRun: unknown, unknown: error #0: commit unexpectedly resulted in rollback","error_fields":{"Code":0,"Msg":"","Op":"schema.(Manager).ApplyMigrations","Wrapped":{"Code":0,"Msg":"","Op":"schema.(Manager).runMigrations","Wrapped":{"Code":0,"Msg":"","Op":"postgres.(Postgres).CommitRun","Wrapped":{}}}},"id":"e_sFvYZOZuqX","version":"v0.1","op":"schema.(Manager).ApplyMigrations"},"datacontentype":"application/cloudevents","time":"2022-04-06T12:50:36.990275117Z"}
Error running database migrations: schema.(Manager).ApplyMigrations: schema.(Manager).runMigrations: postgres.(Postgres).CommitRun: unknown, unknown: error #0: commit unexpectedly resulted in rollback

I even drop all resources in DB before allowing it to try to re-init.
Just to see if it matters, we do use dynamic DB creds for boundary. This means that every time we restart the app it gets new username/password.

Kubernetes init container limits take into account the limits placed on app containers in the same pod, per the rules on resource limits for init containers – although the way I read that section, it’s saying the init container memory limit should be “no limit” since none of your init containers have a memory limit specified. You might have hit a bug there, or the docs might not be accurate and the behavior you’re seeing is the intended behavior. If you set a higher limit on the init container, you might be able to put the limit on the app container back (or just put the app container limit back as a higher limit).

As for the errors you’re seeing now, that looks like your database has some (residual?) Boundary content in it, but isn’t a valid Boundary database – migrate says the database is not inited, but init is complaining about the schema relation already existing. I don’t get this behavior when I run migrate and then init against a blank database. Maybe try dropping the DB entirely and delete the pod for this Deployment so it’ll retry.

Oddly if I run the same commands manually, they work as intended but fail otherwise. If I remove the init-container after manually initializing the database then the server crashes with error The database has not been initialized. Please run 'boundary database init'.

Hmmmm. This sounds like you’re getting some kind of unexpected behavior. What are your database logs showing you? Can you see in the Boundary server pod logs how it’s trying to access the database you inited?

It appears to be an issue with the dynamic database credentials used. The database is created with the Postgres Role boundary but the tables are created by a dynamic user. Then after initialization the database user is different again. It appears I can’t use dynamic creds or I must first initialize the database with the parent role. Would boundary database init ever have a flag to tell it which user to set as owner for objects it creates? Seems to be an issue only with PostgreSQL as we use dynamic creds for MySQL for all in-house apps.