Consul DNS not working after cluster reboot

We have a Consul cluster that went down due to the hardware it was on having the power cut. After it came back up, DNS queries were no longer working. The services that we’re trying to reach are up, but this cluster isn’t responding to .service.consul. This is the primary datacenter of a federated pair of clusters, and the secondary DC is still able to respond to requests, and I can confirm that it responds to queries accurately for the primary DC.

While we’re using the iptables method for forwarding to port 53, I’ve been trying to get this to work querying Consul directly on port 8600. The port is up, and Consul does register that it received a DNS request, but it doesn’t respond with any services.

dig @127.0.0.1 -p 8600 consul.service.dc.consul

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.5 <<>> @127.0.0.1 -p 8600 consul.service.consul
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 36388
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;consul.service.dc.consul.		IN	A

;; AUTHORITY SECTION:
consul.			0	IN	SOA	ns.consul. hostmaster.consul. 1647278528 3600 600 86400 0

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Mon Mar 14 13:22:08 EDT 2022
;; MSG SIZE  rcvd: 100

Consul logs this:

Mar 14 13:24:09 hostname consul[14097]: 2022-03-14T13:24:09.158-0400 [DEBUG] agent.dns: request served from client: name=consul.service.dc.consul. type=A class=IN latency=837.458µs client=127.0.0.1:60215 client_network=udp

The second DC logs a practically identical message in logs, but resolves dig correctly (even when querying for the primary DC):

dig @127.0.0.1 -p 8600 consul.service.dc.consul

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.6 <<>> @127.0.0.1 -p 8600 consul.service.mtl2.consul
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19490
;; flags: qr aa rd; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;consul.service.dc.consul.	IN	A

;; ANSWER SECTION:
consul.service.dc.consul. 0	IN	A	10.208.0.102
consul.service.dc.consul. 0	IN	A	10.208.0.206
consul.service.dc.consul. 0	IN	A	10.208.0.97

;; Query time: 2 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Mon Mar 14 13:29:24 EDT 2022
;; MSG SIZE  rcvd: 103

I figured that since it was working before the reboot, it might have been some kind of ephemeral change that we had made to the iptables rules, but we’re having this issue when querying Consul directly. Unfortunately, the debug doesn’t tell me if it’s having an issue returning a list of servers, or exactly why it returned what it did, so I’m kind of stuck troubleshooting why the DNS is failing to return anything.

For anyone who comes across this kind of issue in the future, the problem for me turned out to be ACLs. TL;DR, the agent token does not govern DNS, only the default one does.

It also turns out that if you dig against a cluster, you don’t actually see in the logs what gets dropped unless it’s the node that Consul queries for DNS. With dns_config.allow_stale set to false by default, any node in the server cluster is capable of answering. If you set that to true, only the leader will answer, and you can see in the logs whether or not something gets dropped. It looks something like this:

[DEBUG] consul: dropping node "your-server" from result due to ACLs

We had moved from the deprecated acl_agent_token field, and it appears while refactoring, someone had cleaned up the fields to get rid of the more current acl.tokens.default and replace it with acl.tokens.agent, which does not govern DNS, only agent actions. It’s stated in the documentation for DNS ACLs that you need to apply this to the default token, but I mistakenly assumed that the agent token would have been enough.