We have a Consul cluster that went down due to the hardware it was on having the power cut. After it came back up, DNS queries were no longer working. The services that we’re trying to reach are up, but this cluster isn’t responding to .service.consul
. This is the primary datacenter of a federated pair of clusters, and the secondary DC is still able to respond to requests, and I can confirm that it responds to queries accurately for the primary DC.
While we’re using the iptables method for forwarding to port 53, I’ve been trying to get this to work querying Consul directly on port 8600. The port is up, and Consul does register that it received a DNS request, but it doesn’t respond with any services.
dig @127.0.0.1 -p 8600 consul.service.dc.consul
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.5 <<>> @127.0.0.1 -p 8600 consul.service.consul
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 36388
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;consul.service.dc.consul. IN A
;; AUTHORITY SECTION:
consul. 0 IN SOA ns.consul. hostmaster.consul. 1647278528 3600 600 86400 0
;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Mon Mar 14 13:22:08 EDT 2022
;; MSG SIZE rcvd: 100
Consul logs this:
Mar 14 13:24:09 hostname consul[14097]: 2022-03-14T13:24:09.158-0400 [DEBUG] agent.dns: request served from client: name=consul.service.dc.consul. type=A class=IN latency=837.458µs client=127.0.0.1:60215 client_network=udp
The second DC logs a practically identical message in logs, but resolves dig correctly (even when querying for the primary DC):
dig @127.0.0.1 -p 8600 consul.service.dc.consul
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-16.P2.el7_8.6 <<>> @127.0.0.1 -p 8600 consul.service.mtl2.consul
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19490
;; flags: qr aa rd; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;consul.service.dc.consul. IN A
;; ANSWER SECTION:
consul.service.dc.consul. 0 IN A 10.208.0.102
consul.service.dc.consul. 0 IN A 10.208.0.206
consul.service.dc.consul. 0 IN A 10.208.0.97
;; Query time: 2 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Mon Mar 14 13:29:24 EDT 2022
;; MSG SIZE rcvd: 103
I figured that since it was working before the reboot, it might have been some kind of ephemeral change that we had made to the iptables rules, but we’re having this issue when querying Consul directly. Unfortunately, the debug doesn’t tell me if it’s having an issue returning a list of servers, or exactly why it returned what it did, so I’m kind of stuck troubleshooting why the DNS is failing to return anything.