Nomad not renewing Vault database token lease

I’m having a strange issue where Nomad doesn’t seem to renew a token lease when using the Vault database secrets engine for PostgreSQL.
I can see in my task environment that the token is still present:


but when I look in postgres the relationship owner has reverted to miniflux which is my placeholder user with nologin when the service isn’t running:

miniflux=# \d
                List of relations
 Schema |       Name        |   Type   |  Owner
 public | acme_cache        | table    | miniflux
 public | api_keys          | table    | miniflux
 public | api_keys_id_seq   | sequence | miniflux
 public | categories        | table    | miniflux
 public | categories_id_seq | sequence | miniflux
 public | enclosures        | table    | miniflux
 public | enclosures_id_seq | sequence | miniflux
 public | entries           | table    | miniflux
 public | entries_id_seq    | sequence | miniflux
 public | feed_icons        | table    | miniflux
 public | feeds             | table    | miniflux
 public | feeds_id_seq      | sequence | miniflux
 public | icons             | table    | miniflux
 public | icons_id_seq      | sequence | miniflux
 public | integrations      | table    | miniflux
 public | schema_version    | table    | miniflux
 public | sessions          | table    | miniflux
 public | sessions_id_seq   | sequence | miniflux
 public | user_sessions     | table    | miniflux
 public | users             | table    | miniflux
 public | users_id_seq      | sequence | miniflux

If I restart the miniflux service then Vault replaces the owner to the token, but within a few hours it reverts to miniflux again.

I have this vault policy set up for the service:

path "postgres/creds/miniflux" {
  capabilities = ["read"]

and in the job I’m specifying this vault stanza:

  vault {
    policies = ["miniflux"]
    change_mode = "restart"

and of course templating the credentials in environment variables:

      template {
        destination = "secrets/env"
        change_mode = "restart"
        env         = true
        data        = <<-EOH
          {{ with secret "postgres/creds/miniflux" -}}
          DATABASE_URL=postgres://{{ .Data.username }}:{{ .Data.password }}@postgres.service.consul/miniflux?sslmode=disable
          {{- end }}

My nomad servers are configured with these Vault parameters according to the Vault integration instructions here:

vault {
  enabled = true
  address = "http://active.vault.service.consul:8200"
  token   = "<token>"
  create_from_role = "nomad-cluster"

I can only see this error in the Nomad logs:

Jan 04 12:03:44 nas nomad[975273]:     2023-01-04T12:03:44.856+0100 [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/24a10ded-cb1a-d600-6638-99c6a867a6b8/miniflux/secrets/env"
Jan 04 12:03:44 nas nomad[975273]:     2023-01-04T12:03:44.857+0100 [INFO]  agent: (runner) stopping
Jan 04 12:03:44 nas nomad[975273]:     2023-01-04T12:03:44.857+0100 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
Jan 04 12:03:44 nas nomad[975273]:     2023-01-04T12:03:44.857+0100 [INFO]  agent: (runner) received finish
Jan 04 12:03:44 nas nomad[975273]:     2023-01-04T12:03:44.857+0100 [INFO]  agent: (runner) creating watcher
Jan 04 12:03:44 nas nomad[975273]:     2023-01-04T12:03:44.858+0100 [INFO]  agent: (runner) starting
Jan 04 12:03:44 nas nomad[975273]:     2023-01-04T12:03:44.961+0100 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=e7446c8177098b00d96e4366b69be7dcaee2991fe8637dbcb0c2b48d0867ac2c
Jan 04 12:03:45 nas nomad[975273]:     2023-01-04T12:03:45.140+0100 [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/24a10ded-cb1a-d600-6638-99c6a867a6b8/miniflux/secrets/env"
Jan 04 12:03:45 nas nomad[975273]:     2023-01-04T12:03:45.276+0100 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=e7446c8177098b00d96e4366b69be7dcaee2991fe8637dbcb0c2b48d0867ac2c
Jan 04 12:03:52 nas nomad[975273]:     2023-01-04T12:03:52.771+0100 [ERROR] nomad.event_broker: failed resolving ACL for secretID, closing subscriptions: error="ACL token not found"

The above is quite ambiguous for me though, I’m not sure what ACL token the error is referring to.

The lease renewal failed again overnight today, for 3 out of 4 services which all run with the same configs (same vault policy rules, db creation/renewal/revoke statements and same setup with the same vault and template stanzas in the job except for the service names of course. I Terraform all of the Vault resources, so I know they are fully identical.

What’s even more curious is that previously my vaultwarden service didn’t use to have this issue, it’s been running great for more than a year, but now it also started behaving like this where the lease is not renewed after redeploying it with the same config, so something appears to have changed on the nomad/vault level which causes this to fail when redeployed even with the previously working config?

The last of the services which is currently still successfully getting renewed tokens is my instance of huginn, I can do a test later and redeploy it since I’ve changed absolutely nothing in the config for that service either in quite some time, and see if that also starts to fail.

Is this a known issue or have I missed something?

1 Like

I’m a bit frustrated by this atm as well. There doesn’t seem to be any information about how Nomad handles Vault leases in the documentation, so I just assumed it would “do the right thing”, but it doesn’t.

I had to write a shell script to renew/refresh creds and call the main process in the background killing and restarting it every time a new token comes down. There has got to be a nicer way to do this.

Interesting, does this mean then that we cannot rely on Nomad to renew database credential leases? I too could not find any information regarding the lease renewal process in Nomad.

The only mitigation it seems is to have really long leases then?