So I did some digging today. This is all on version 19.1.
If you add --log-level=debug
to your startup command, you’ll get to see what this error is from. I can now see:
2024-07-17T18:04:29.437-0700 [ERROR] agent.dns: error serializing DNS results: error="no data"
2024-07-17T18:04:29.437-0700 [DEBUG] agent.dns: no data available: name=myservice.service.consul.
So for some reason, this node is answering that it doesn’t have any DNS data for myservice
. What’s weird is that myservice
is definitely a legitimate service, and the consul node knows about it:
$ dig @127.0.0.1 -p 8600 +short myservice.service.consul
100.104.105.106
It’s also not specific to one service, different services are named (randomly), though it appears that the services are a subset of all the services we have.
I was wondering if there was some race condition, so I ran this in a loop:
$ while true; do dig @127.0.0.1 -p 8600 +short myservice.service.consul; sleep 1; done
and I seem to always get results, even as I watch the consul node output this message.
Here’s my config, if that’s at all helpful to anyone who stumbles upon this:
advertise_addr = "x.x.x.x"
advertise_addr_ipv4 = "x.x.x.x"
auto_reload_config = true
bind_addr = "0.0.0.0"
bootstrap_expect = 6
check_update_interval = "60s"
client_addr = "0.0.0.0"
data_dir = "/consul"
datacenter = "dc1"
dns_config = {
allow_stale = true
max_stale = "45s"
service_ttl {
"*" = "60s"
}
node_ttl = "300s"
only_passing = true
}
autopilot {
min_quorum = 4.0
}
retry_join = ["node1.internal", "node2.internal", "node3.internal", "node4.internal", "node5.internal", "node6.internal"]
server = true
node_name = "node1"
ui_config = {
enabled = true
}
I’m at a loss as to what the root cause here is, but I’m at least becoming convinced this error is mostly a red herring and doesn’t actually affect anything (at least AFAICT). I’d still love to know what’s causing it though.