Nomad job doesn't renew consul token generated using nomad job templates and consul secrets engine

We have a nomad fabio job which communicates with consul using a token generated with consul secrets engine in vault. We are following the exact steps as https://learn.hashicorp.com/tutorials/nomad/vault-postgres specific to the fabio job.

Job template looks like below -

template {
data = <<EOH
{{ with secret “consul/creds/service-test” }}
registry.consul.token = {{ .Data.token }}
{{ end }}
registry.consul.kvpath = /path-to-config
registry.consul.noroutehtmlpath = /path-to-noroute
proxy.addr =
ui.addr =
registry.consul.register.addr =
metrics.target =
metrics.statsd.addr =
metrics.prefix =
metrics.names =
EOH
destination = “secrets/fabio.properties”
}

We would expect nomad to take care of renewing the consul token when the lease associated with the token expires but we don’t see that happening in our environment. Is this the expected behavior? or does token renewal have to be done explicitly by calling command below?

vault read consul/creds/service-test

@tgross have you seen this one before? Its following the same format as the PG example as the author stated

In this case it seems the token isn’t getting renewed and since there’s no way to put a watch on that specific token it seems even the PG example above would start to fail after 1 hour

Is Nomad supposed to handle renewal of secret engine tokens?

No, it should happen automatically, once a fraction of your lease time/ttl is left. What the value of task_token_ttl is your vault stanza? Is it less than the ttl of the token you’ve generated?

This is curious:

The job uses the template stanza's vault integration to populate the JSON configuration file that the application needs. The underlying tool being used is Consul Template. You can use Consul Template's documentation to learn more about the syntax needed to interact with Vault.

When I recently used Vault’s integration with Consul Template, I had to explicitly include renew = true in the associated vault stanza. But that seems to be missing from the example. I wonder what would happen if you included it anyway.

That’s interesting. Yes, consul-template needs that explicitly added, maybe that’s something to try here. Although I see that this setting is a part of vault stanza inside consul-template, and the documentation talks about using syntax same as template stanza inside consul-template

https://www.nomadproject.io/docs/job-specification/vault - This doesn’t show that option.

Ah, but note this from the page on the template stanza:

For a full list of the API template functions, please refer to the Consul Template README.

What confuses me is this line from the vault stanza page:

If Nomad is unable to renew the Vault token (perhaps due to a Vault outage or network error), the client will attempt to retrieve a new Vault token.

Details on this behaviour are missing, including whether it is configurable.

Thanks @jlj7 - it does seem like there’s some vagueness around this area in terms of the documentation. Hopefully someone from the Nomad and/or Consul-Template team can weigh in too. Right now, it seems like this may be a bug though.

1 Like

@jlj7 - got it to work and posted details here: https://github.com/hashicorp/nomad/issues/9491

Basically it does seem like Nomad and Consul-Template are renewing (and restarting a container when the token completely expires).

I can only assume the issue was related to 1 of the following

  • The default values of 768h was so long a watcher never fired or just flat out died and never recovered (given this was seen on multiple servers across multiple clusters it seems unlikely)
  • Not having an explicity max-lease-ttl and default-least-ttl doesn’t play nicely.

It’s a bit of a guess but the fact that this was seen across a number of clusters (all completely isolated) around about the same time makes me think there’s something amiss with the blank defaults.

3 Likes

@idrennanvmware I was able to reproduce the issue again with explicit lower default ttl and max-lease-ttl.

The watcher is doing the job just fine. The problem is that nomad does not persist state (at least template related state, havent looked anyother places) across client restart.

So when nomad client restarts, the lease issued by previous client session leaks (no one revokes it, it will expire when it reaches its ttl).
And the new client session will follow the normal NEW job start procedure, mainly:

  1. run prehooks
  2. run templates and initialize watcher
  3. start containers (already running)

So it will template in a new VALID token into the container, but since the container was already running, the process running inside the container still held the old token. Then what happens is nomad template watcher will keep the new lease up-to-date, but when the old lease expires the process running inside the container start to fail.

I don’t think this should be an intended behavior… But solving this issue correctly seems require adding state persistent into nomad.

Want to hear from what nomad team says @tgross