Consul as DNS server in production - deployment architecture

We have a production environment with around 15 servers that hosts about 30 IIS based REST services. We are looking at using Consul as a DNS server to lookup the services. It is not very clear from the deployment reference: https://learn.hashicorp.com/consul/datacenter-deploy/day1-deploy-intro or other documentation what is the best approach to consume the DNS servers.

It is clear we must host 3 - 5 Consul server instances for resiliency. There can be any number of clients as we need, up to into the thousands. What is not clear is how the application must interact with the DNS server(s).

Would the recommended deployment be

  1. Have a Consul client on each of the 15 application servers, each client binding to the default loopback address “127.0.0.1”, add 127.0.0.1 as the DNS server on each application server to talk to its local Consul client.

or

  1. Have a set of Consul agents as clients, say 5, which can be on some of the 15 servers, have these clients bind to their machine IP address using client_addr. The application will use any of the 3 Consul servers and 5 clients as DNS server to lookup the name mapping.

I suppose both get the job done, the first will probably be faster but will be more instances to maintain. The second will be fewer nodes keep available, but potentially slower and is over the network.

There are some more complexities with where to run the agents other than just where to configure a servers DNS resolver config at.

Both scenarios you mention would certainly provide DNS services but neither will be inherently more performant than the other. Consul’s DNS internally is going to make RPC requests to the servers to do the actual service discovery. So in the scenario where you have some clients and the servers providing DNS it may actually be quicker to just send the DNS request directly to the servers. However if you send the request to a client not on the same machine there will be a very small amount of extra latency but I would assume that in most cases this won’t be very noticeable.

The real reason to run 1 client per physical server is for service registration and health checking. All the services running on that node can be registered with the local agent instead of needing to register the service with a non-local agent. Consul will automatically detect if that node ever goes offline and automatically mark the health checks as failing. Additionally, by not registering with a non-local agent the other node the non-local agent is running on can be restarted without impacting the services on the other nodes.

In conclusion going with 1 client agent per node is the more typical scenario. The agents are lightweight enough that it shouldn’t be a problem. There are scenarios and users that cannot do this for various reasons but usually would use consul-ram to manage the services running in nodes where client agents cannot be run.

Hopefully that helps.

Thanks for the info Matt. We have a similar need for health check and service registration, so a client per server should work for us too.