When watching a lease, it's not getting revoked

dhduvall · April 30, 2020, 6:25pm

I apologize for the length of this; I’m not entirely sure what’s relevant and what’s not, so I wanted to provide as much context as possible.

I’m working on a Go module that wraps a database connection with ephemeral database roles generated by Vault. It presents as a database/sql/driver.Connector, acquiring new credentials from Vault on each Connect() call if they’re needed (on first connect, or after lease expiration), or using the existing ones if not. It uses a LifetimeWatcher to keep track of when the credentials are expiring. I had to add another channel to trigger getting new credentials when the token the app is using expires and is replaced, which was a bit surprising to me, but it seems to work.

I’m now adding support for Postgres’ LISTEN command, via the Listener type in github.com/lib/pq. In the course of this, I discovered that existing database connections were not severed when their credentials were revoked, so I set the revocation_statements to the following:

SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE usename = '{{name}}';
DROP ROLE IF EXISTS "{{name}}";

so that the connections would be forcibly severed. For “normal” (non-LISTEN) connections, this doesn’t have much effect, since the connections don’t typically stay open for long anyway (pretty low volume of calls).

For Listener connections, though, I’m getting some odd behavior. If I just grab credentials from Vault on the first connection and then every time the Listener tries to reconnect and fails (the pq code sends a notification when that happens, giving me the opportunity to get new creds before reconnecting), then that’s exactly what happens when the credentials’ TTL is reached: the DB backend is killed, the listener is disconnected, attempts reconnection, fails due to bad credentials, I get new ones, reconnect, and everything is good. The downside of this is that this takes a little bit of time, so there’s a window where I can miss a notification.

Instead, I’d like to set up a new listener prior to expiration, so that I can cover that window. So I changed my listener code to use the same LifetimeWatcher code that I’m using for the Connector to tell me when that’s going to happen. Everything just worked, until I looked under the covers. I get the first renewal notification (on watcher.RenewCh()) immediately, as always (odd behavior, but it hasn’t been a problem), and then again three-quarters of the way through each lease, until the max TTL is reached, all as expected.

But when I get a notification on watcher.DoneCh(), and I set my Connector up to fetch new credentials the next time through, my listener just keeps on listening, and its connection is never terminated. The Vault trace logs don’t show the revocation, and the Postgres logs don’t show the revocation SQL statements coming through, either. I put some debugging statements into handleRevokeRenew() in sdk/framework/backend.go to see what’s going on, and I never see the revocation operation; it’s only ever renew. (And even then the renews only happen up to the point where the revocation should happen.)

I don’t really understand why having the watcher running would make a difference here. I expect there’s a bug in my code somewhere, or at least a misunderstanding of how to use the Vault Go API interfaces properly, but I’m not sure what to look for. I also have not yet confirmed that the revocations were happening in the non-listener context, but the testing I did do suggested that they were.

The code is not mine to share, but I will ask, if that’ll be useful. I’d like to open-source this, anyway. But I suspect that this problem will either sound familiar to someone and they can explain what I must be doing wrong, or I’ll just have to find it myself.

This is all with Vault 1.3.2. Also, all running in a test suite, using a setup similar to how the Vault tests work.

Thanks!
Danek

calvn · May 1, 2020, 12:20am

Hi there!

Thanks for providing such a detailed description. I since I don’t have access to the source code on how you’re calling and referencing the API methods, my response might be incomplete, but I’ll try to address your points inline as best as I can.

But when I get a notification on watcher.DoneCh() , and I set my Connector up to fetch new credentials the next time through, my listener just keeps on listening, and its connection is never terminated.

You can only receive from LifetimeWatcher.DoneCh() channel when the watcher either 1) stops i.e. the renewal loop breaks due to the stopCh being closed, which would send back nil or 2) there was an issue renewing the token, which sends back the actual error. In either case, having your Connector fetch new credentials based on what you get from this channel would not work, as the renewer would never return back a renewed token. I don’t have your side of the source code to be able to make a solid recommendation, but maybe a possible action to take here, whenever you receive from DoneCh, could be to log the error (if one is returned), and possibly call watcher.Start() if you want to retry the renewal process (though this might simply fail again due to its original error).

Postgres logs don’t show the revocation SQL statements coming through, either. I put some debugging statements into handleRevokeRenew() in sdk/framework/backend.go to see what’s going on, and I never see the revocation operation; it’s only ever renew. (And even then the renews only happen up to the point where the revocation should happen.)

The LifetimeWatcher does not perform any revocation which is why you don’t see any revocation-related calls being triggered when logging in handleRevokeRenew. Revocation of the secret is taken care of independently and internally within Vault’s expiration manager whenever the token or the generated secret by this token expires when its TTL reaches 0. If you’re testing, you could set the TTL of the created secrets, via the role’s max_ttl, to be some short value to more quickly observe this.

dhduvall · May 11, 2020, 6:34pm

Thanks for your reply; I apologize for not responding sooner, but I got caught up in another project last week and didn’t have the cycles.

In either case, having your Connector fetch new credentials based on what you get from this channel would not work, as the renewer would never return back a renewed token. I don’t have your side of the source code to be able to make a solid recommendation, but maybe a possible action to take here, whenever you receive from DoneCh, could be to log the error (if one is returned), and possibly call watcher.Start() if you want to retry the renewal process (though this might simply fail again due to its original error).

Right; when I get something back on DoneCh(), I log it (whether it’s an error or nil), stop the watcher, and set a flag indicating that the credentials are invalid, so the next time something tries to connect, new credentials are acquired, and a new LifetimeWatcher is fired up. So far, I don’t think I’ve seen any errors, so I’m always going through this process when we’re simply not allowed to renew the lease any longer.

The LifetimeWatcher does not perform any revocation which is why you don’t see any revocation-related calls being triggered when logging in handleRevokeRenew.

That seems obvious in retrospect; I was tired and flailing around a bit, searching for where the revocation statements were being executed.

Revocation of the secret is taken care of independently and internally within Vault’s expiration manager whenever the token or the generated secret by this token expires when its TTL reaches 0. If you’re testing, you could set the TTL of the created secrets, via the role’s max_ttl , to be some short value to more quickly observe this.

That’s exactly what I have. I’m running this in a test, so I have the ttl for the role set to 1, and max_ttl to 1 as well, since a beat of once a second is fine for the test (actually, it’d be awesome to set shorter timeouts, since this makes the Vault-related tests the longest in my suite, but it looks like Vault doesn’t allow for that).

I’ll ping the folks in charge of confirming that I can post the code publicly, and try to make that available soon.

dhduvall · May 11, 2020, 11:12pm

I’ve posted the code at https://github.com/dhduvall/vaultdb. I haven’t yet extracted its dependencies on code that isn’t yet open, but the relevant code is there. The test that demonstrates the issue is testCredentialRevocation. I will try to get it soon into a condition where it can be run.