I’m trying to figure out how to restart or upgrade the Consul agent on a node without bringing the services on the node critical. Right now it seems impossible, but it seems like such a common use case that I feel like I must be missing something.
To keep things simple: I have an application on multiple nodes, and some database servers. The database servers register a Consul service with a check. The applications are configured with consul-template, and reload their config when the template updates (or, if you prefer, they poll DNS for changes, either works equally poorly here!) The application’s database connections are persistent.
The problem is this: I want to restart (reconfigure, or upgrade) the Consul agent on the writer database server.
Stopping is easy: I
kill -9 the agent and it does not leave the cluster. The service remains passing in the catalog and the application servers are healthy. But then I start consul, and all the checks immediately go critical. A minute or two (!) later, the checks are passing again. But in the meantime, all of my application servers had their database connections interrupted simultaneously even though the database was healthy!
I know Consul caches service check status across non-
leave restarts for TTL checks, but not for any other checks.
Is there any way to accomplish this? I find my team is postponing Consul upgrades and restarts because of the complexity of having to restart every client agent. Our database servers, stateful, do not lend themselves to immutable infrastructure.
(Our infrastructure is more complicated than this example, too, but the persistent DB connections are the best example of how this bites us.)
We do regular Consul upgrades and haven’t quite seen anything like you’re describing (we have a mix of windows and linux agents). What we have seen is a few DNS cache misses if it’s timed with a cache expiry but even then our systems recover almost instantly.
Are you stopping the service for any significant period of time, or are you just restarting it after a config change?
Also, I was thinking about your scenario and trying to understand why your systems are losing their DB connections on a consul restart? Could you maybe elaborate a little more on this use case of consul template? It may help shed some light on the underlying issue.
The problem is that when Consul starts, any service checks it knows about start off as critical. So we’re losing DB connections because anything that relies on that service sees it go critical.
In DNS, yes, you’d see the response change, but templates are watch-based, so the service is told “This database server is unhealthy”.
Consul-template here is via nomad, but the template contains the database host, and on template update we reload the service.
You can configure the initial status of a service-level or node health check using the
status field on the check definition. See https://www.consul.io/docs/discovery/checks#initial-health-check-status for examples.
Right, but I don’t want the initial status, I want the status carried across a restart. Maybe that particular instance is critical, I don’t know – and the folks doing the consul upgrade don’t know the status of every service registered on every node.
The best I can do right now leverages that initial status, but it means that I’d have to write a restart wrapper that
- collects all the services on the node from the catalog
- records their initial status setting and current status
- re-registers the service with the initial status = current status
- updates any on-disk configuration so that initial status = current status
- stops consul without leave
- starts consul
- re-registers all the API-registered services with their old initial status
- updates any on-disk configuration to their original initial status
And that’s a lot of work and a lot of edge cases for a transparent restart, when consul could do that for me for service checks like it does now for TTL checks.
Did you ever get a good solution to this. I have the same problem when I upgrade consul and all the templates get re-evaluated and a wormhole opens on the internet where my sites used to be for 20 seconds…
Ideally I’d like to suspend template re-evaluation if the local is temporarily unreachable (or cache the KV result) for a few seconds.