We are using an older version of Consul (v1.3.1) in a multi-node configuration (no docker, no k8s) and are seeing some behavior that may be correct, but still seems counter-intuitive.
The configuration is that we run a number of largely independent services on each node and every node has a similar set of services running. Those services have a fairly robust set of checks associated with them (script, TTL, http). The problem is that if any service check reports as critical, we don’t just get a failure of the DNS for that service instance, but every service running on that node.
If we change the status of the service check to warning, we can get DNS records for the node again. We could certainly configure the checks to deregister critical services, but it looks like the minimum time for that is 1 minute and a full minute of no DNS at all seems painful.
Again, our configuration of a set of independent services may be an atypical deployment but it seems strange to me that Consul would infer that a single service being bad means that everything running on that node is also bad.
Is this something that has been changed in subsequent versions of Consul (and perhaps not documented)? Can this behavior be altered through the configuration somehow?
Unless I am missing something my options seem to be:
- Have something periodically hit the Consul API and change the critical checks to warning so I can get back DNS (every few seconds)
- Run script checks for everything and have them return warning messages
- Have something periodically hit the Consul API and deregister any critical checks/services more rapidly than the deregister.
All those options seem somewhat terrible, so if there is a better choice I am all ears.