We would expect nomad to take care of renewing the consul token when the lease associated with the token expires but we don’t see that happening in our environment. Is this the expected behavior? or does token renewal have to be done explicitly by calling command below?
@tgross have you seen this one before? Its following the same format as the PG example as the author stated
In this case it seems the token isn’t getting renewed and since there’s no way to put a watch on that specific token it seems even the PG example above would start to fail after 1 hour
Is Nomad supposed to handle renewal of secret engine tokens?
No, it should happen automatically, once a fraction of your lease time/ttl is left. What the value of task_token_ttl is your vault stanza? Is it less than the ttl of the token you’ve generated?
This is curious:
The job uses the template stanza's vault integration to populate the JSON configuration file that the application needs. The underlying tool being used is Consul Template. You can use Consul Template's documentation to learn more about the syntax needed to interact with Vault.
When I recently used Vault’s integration with Consul Template, I had to explicitly include renew = true in the associated vault stanza. But that seems to be missing from the example. I wonder what would happen if you included it anyway.
That’s interesting. Yes, consul-template needs that explicitly added, maybe that’s something to try here. Although I see that this setting is a part of vault stanza inside consul-template, and the documentation talks about using syntax same as template stanza inside consul-template
Thanks @jlj7 - it does seem like there’s some vagueness around this area in terms of the documentation. Hopefully someone from the Nomad and/or Consul-Template team can weigh in too. Right now, it seems like this may be a bug though.
Basically it does seem like Nomad and Consul-Template are renewing (and restarting a container when the token completely expires).
I can only assume the issue was related to 1 of the following
The default values of 768h was so long a watcher never fired or just flat out died and never recovered (given this was seen on multiple servers across multiple clusters it seems unlikely)
Not having an explicity max-lease-ttl and default-least-ttl doesn’t play nicely.
It’s a bit of a guess but the fact that this was seen across a number of clusters (all completely isolated) around about the same time makes me think there’s something amiss with the blank defaults.
@idrennanvmware I was able to reproduce the issue again with explicit lower default ttl and max-lease-ttl.
The watcher is doing the job just fine. The problem is that nomad does not persist state (at least template related state, havent looked anyother places) across client restart.
So when nomad client restarts, the lease issued by previous client session leaks (no one revokes it, it will expire when it reaches its ttl).
And the new client session will follow the normal NEW job start procedure, mainly:
run prehooks
run templates and initialize watcher
start containers (already running)
So it will template in a new VALID token into the container, but since the container was already running, the process running inside the container still held the old token. Then what happens is nomad template watcher will keep the new lease up-to-date, but when the old lease expires the process running inside the container start to fail.
I don’t think this should be an intended behavior… But solving this issue correctly seems require adding state persistent into nomad.