I’m trying to figure out how to restart or upgrade the Consul agent on a node without bringing the services on the node critical. Right now it seems impossible, but it seems like such a common use case that I feel like I must be missing something.
To keep things simple: I have an application on multiple nodes, and some database servers. The database servers register a Consul service with a check. The applications are configured with consul-template, and reload their config when the template updates (or, if you prefer, they poll DNS for changes, either works equally poorly here!) The application’s database connections are persistent.
The problem is this: I want to restart (reconfigure, or upgrade) the Consul agent on the writer database server.
Stopping is easy: I
kill -9 the agent and it does not leave the cluster. The service remains passing in the catalog and the application servers are healthy. But then I start consul, and all the checks immediately go critical. A minute or two (!) later, the checks are passing again. But in the meantime, all of my application servers had their database connections interrupted simultaneously even though the database was healthy!
I know Consul caches service check status across non-
leave restarts for TTL checks, but not for any other checks.
Is there any way to accomplish this? I find my team is postponing Consul upgrades and restarts because of the complexity of having to restart every client agent. Our database servers, stateful, do not lend themselves to immutable infrastructure.
(Our infrastructure is more complicated than this example, too, but the persistent DB connections are the best example of how this bites us.)