Vault LifetimeWatcher - Token not renewed

Hi everyone,

I am currently trying to setup a service that uses the NewLifetimeWatcher to renew it’s token.

Environment

  • Vault API: github.com/hashicorp/vault/api v1.9.2
  • Golang: 1.20

Observations

  • When doing some local testing with short ttl (e.g. 1m, 5m, 30m) the renewer behaves fine:
     vault token renewed successfully at: 2023-07-18 08:16:56.659702289 +0000 UTC
     vault token renewed successfully at: 2023-07-18 08:17:39.307710154 +0000 UTC
     ...
    
  • If the ttl is longer (e.g. 768h) the token seems to not be always renewed; which results in subsequent requests getting a 403 of course and making my service fail (making creating & injecting a new token necessary)

Issues

The token is just sometimes expired without seeing anything useful in my logs.

  • Is there any way to debug this properly? I unfortunately was not able to find any logging opportunity or similar in the lifetime_watcher.go (or anwhere else for that matter)
  • Could there be anything else wrong in the implementation (see code snippet below)
    I am at least suspecting, that in the lifetime_watcher.go either the calculateSleepDuration()

Code Snippet


func main() {
    watcher, err := lifetimeWatcher()
    if err != nil {
        logrus.Errorf("error setting up vault token lifetime watcher: %v", err)
        return
    }
    go watcher.Start()
    defer watcher.Stop()

    go func() {
        for {
            select {
            case err := <-watcher.DoneCh():
                if err != nil {
                    logrus.Errorf("error when renewing vault token %v", err)
                }
            case renewal := <-watcher.RenewCh():
                logrus.Infof("vault token renewed successfully at: %v", renewal.RenewedAt)
            }
        }
    }()

    scheduler.block()
}

func lifetimeWatcher() {
    secret, err := client.Auth().Token().LookupSelf()
    if err != nil {
        return nil, err
    }

    if ok, _ := secret.TokenIsRenewable(); !ok {
        return nil, fmt.Errorf("secret is not renewable")
    }
    var increment int64
    if v, ok := secret.Data["creation_ttl"]; ok {
        if n, ok := v.(json.Number); ok {
            increment, err = n.Int64()
            if err != nil {
                return nil, err
            }
        }
    }

    // renew once on start as lookupSelf initially does not have auth info
    s, err := v.client.Auth().Token().RenewSelf(int(increment))
    if err != nil {
        return nil, err
    }

    secret.Auth = s.Auth

    return v.client.NewLifetimeWatcher(&api.LifetimeWatcherInput{
        Secret:    secret,
        Increment: int(increment),
    })
}

Thanks a lot.

Since you can only reproduce the issue with longer TTLs, it sounds like it would be sensible to fork lifetime_watcher.go and add logging to it, to have more visibility as to what is going on.

As you’ve observed, it has no built-in instrumentation.

alright - will do.
thanks for the suggestion.

I now did as suggested (and since the issue only happens with longer token lifetimes this took a while).

I have some observations that confuse me.

In the api/lifetime_watcher.go I implemented log statements as follows:

[api] <some debug message>

Now regarding the output over the last weeks:

  1. the service was started at ~2023-08-22 12:51:08 (and begins by renewing its token after start:
vault token renewed successfully at: 2023-08-22 12:51:08.988585023 +0000 UTC
  1. a little later the API logging kicked in:
2023-08-22 @ 13:04:04.000 [api] doRenewWithOptions: initialTime=2023-08-22 13:04:04.031317424 +0200 CEST m=+0.160561419, priorDuration=668h16m35s

2023-08-22 @ 13:04:04.000 [api] renewal channel: RenewOutput time=2023-08-22 11:04:04.043371326 +0000 UTC secret=[leaseID= leaseDuration=0 data=map[] authLeaseDuration=2405795]

2023-08-22 @ 13:04:04.000 [api] renewal loop: initLeaseDuration=2405795 remainingLeaseDuration=668h16m35s

2023-08-22 @ 13:04:04.000 [api] errorBackoff nil, priorDuration=668h16m35s sleepDuration=468h0m56.675463134s

so with the sleep duration of 468h0m56.675463134s - (aka roughly 19 days and 13 hours) the next renewal should happen around 2023-09-11 06:55)

  1. next logs are around that time at least:
2023-09-11 @ 01:05:00.000 [api] renewal channel: RenewOutput time=2023-09-10 23:05:00.835052223 +0000 UTC secret=[leaseID= leaseDuration=0 data=map[] authLeaseDuration=720939]

2023-09-11 @ 01:05:00.000 [api] renewal loop: initLeaseDuration=720939 remainingLeaseDuration=200h15m39s

2023-09-11 @ 01:05:00.000 [api] within grace remainingLeaseDuration=200h15m39s sleepDuration=156h0m19.3421298s grace=67h29m40.026389402s

for the last log line I seem to land within the latter condition in the doRenewWithOptions function:

if remainingLeaseDuration <= r.grace || remainingLeaseDuration-sleepDuration <= r.grace {
  return nil
}

aka return nil kicks in.
And if I understand go channels correctly this means more or less done aka no more renewal.

  1. Since renewal should have happened until ~09:20 this morning (2023-09-11 01:05 + 200h15m39s = ~2023-09-19 09:20) - but didn’t (I confirmed by checking the used token with vault - which gives me 403).

With this I am wondering: why is this there? Am I missing something here?

Thanks a lot!

(please note that I am in UTC+2 which explains the 2 hours difference that are visible in some places).