After updating to latest Consul I'm getting "error serializing DNS results" errors in my logs

After updating my consul agents this morning I’m seeing a flood of errors in my logs of the types:

[ERROR] agent.dns: error serializing DNS results: error="no data"

and

[ERROR] agent.dns: error processing discovery query: error="not found"

These errors are coming from the agents on the compute nodes rather than the server quorum nodes. I assume it’s to do with dns requests but I have no clue what might be causing them.

Anyone have any ideas?

Hello,

I have exactly the same problem.
I was on version 1.18.2, and even after an upgrade to version 1.19, this message persists.

For example, if I try to resolve this:
..service.consul.

the korum log reports this message.

I welcome your feedback.

I also see this error on Consul 1.19, and I’m not sure what is causing it either.

We had a similar problem after upgrading to v1.19 and the service lookups didn’t work correctly. v1.19 has some known issues with DNS: 1.19.x | Consul | HashiCorp Developer.

Upgrading v1.19.1 resolved the problems for us. s. Release v1.19.1 · hashicorp/consul · GitHub

So I did some digging today. This is all on version 19.1.

If you add --log-level=debug to your startup command, you’ll get to see what this error is from. I can now see:

2024-07-17T18:04:29.437-0700 [ERROR] agent.dns: error serializing DNS results: error="no data"
2024-07-17T18:04:29.437-0700 [DEBUG] agent.dns: no data available: name=myservice.service.consul.

So for some reason, this node is answering that it doesn’t have any DNS data for myservice. What’s weird is that myservice is definitely a legitimate service, and the consul node knows about it:

$ dig @127.0.0.1 -p 8600 +short myservice.service.consul
100.104.105.106

It’s also not specific to one service, different services are named (randomly), though it appears that the services are a subset of all the services we have.

I was wondering if there was some race condition, so I ran this in a loop:

$ while true; do  dig @127.0.0.1 -p 8600 +short myservice.service.consul; sleep 1; done

and I seem to always get results, even as I watch the consul node output this message.

Here’s my config, if that’s at all helpful to anyone who stumbles upon this:

advertise_addr = "x.x.x.x"
advertise_addr_ipv4 = "x.x.x.x"

auto_reload_config = true
bind_addr = "0.0.0.0"
bootstrap_expect = 6
check_update_interval = "60s"
client_addr = "0.0.0.0"
data_dir = "/consul"
datacenter = "dc1"

dns_config = {
  allow_stale = true
  max_stale = "45s"

  service_ttl {
    "*" = "60s"
  }
  node_ttl = "300s"

  only_passing = true
}

autopilot {
  min_quorum = 4.0
}

retry_join = ["node1.internal", "node2.internal", "node3.internal", "node4.internal", "node5.internal", "node6.internal"]
server = true

node_name = "node1"

ui_config = {
  enabled = true
}

I’m at a loss as to what the root cause here is, but I’m at least becoming convinced this error is mostly a red herring and doesn’t actually affect anything (at least AFAICT). I’d still love to know what’s causing it though.

I figured it out!

I had a service that was asking Consul for an IPv6 address (an AAAA DNS record). Consul doesn’t know about an IPv6 address because advertise_addr_ipv6 in the config isn’t set, which is the source of this error. Really specifically, this line asserts that Consul can answer an AAAA request with an IPv6 address, otherwise it returns an empty answer section, which is what causes this error text.

I don’t think this error message actually affected anything since we just use IPv4 everywhere, which explains why I couldn’t reproduce any failures in my earlier posts.

To fix, either remove IPv6 from the querying service (which in our case was an external docker container that had IPv6 networking configured) or set advertise_addr_ipv6 in the config to an address.

It would be nice if Consul was a little bit more helpful in it’s log output. I’ve opened a PR for this here: Add a debug level logging message for mismatched DNS Records and IPv4/v6 Addresses by keefertaylor · Pull Request #21552 · hashicorp/consul · GitHub.

1 Like

Ah, this is really helpful, thanks! Now I just need to track down which of my many containers are trying to resolve AAAA records :slight_smile: I guess this also means it’s not actually an error worth worrying about.

Looks like this is going to be fixed in 1.19.2