Concul DNS forwarding

Hello,

We’re using the consul DNS forwarding within nomad but running into an issue.
Initially tried the forwarding with Dnsmasq with the below config

server=/consul/127.0.0.1#8600
cache-size=4000
dns-forward-max=5000
local-ttl=30
log-queries

and getting an issue when running a lot of queries at once where the dns intermittently cannot resolve and returns bad address from inside containers running in nomad.
Initially thinking it was potentially a dnsmasq issue we have also tried unbound and getting the same result.
unbound configuration

#Allow insecure queries to local resolvers
server:
  do-not-query-localhost: no
  domain-insecure: "consul"

#Add consul as a stub-zone
stub-zone:
  name: "consul"
  stub-addr: 127.0.0.1@8600

which now make me believe it could potentially be an issue with consul.
configuration below

datacenter                 = "aws"
data_dir                   = "/opt/consul"
log_level                  = "DEBUG"
node_name                  = "nomad-1"
advertise_addr             = "192.168.1.1"
encrypt                    = ""

tls {
  defaults {
    ca_file                = "/etc/consul.d/certs/consul-agent-ca.pem"
    ca_path                = "/etc/consul.d/certs"
    cert_file              = "/etc/consul.d/certs/server-consul-0.pem"
    key_file               = "/etc/consul.d/certs/server-consul-0-key.pem"
    verify_incoming        = true
    verify_outgoing        = true
  }
  internal_rpc {
    verify_server_hostname = true
  }
}

auto_encrypt {
  allow_tls                = true
}

retry_join                 = ["192.168.1.1","192.168.1.2","192.168.1.3","192.168.1.4","192.168.1.5"]

acl {
  enabled                  = true
  default_policy           = "allow"
  enable_token_persistence = true
}

performance {
  raft_multiplier          = 1
}

server                     = true
bootstrap_expect           = 5
bind_addr                  = "192.168.1.1"
client_addr                = "0.0.0.0"

# Enable service mesh
connect {
  enabled                  = true
}

# Addresses and ports
addresses {
  grpc                     = "127.0.0.1"
  https                    = "0.0.0.0"
  dns                      = "127.0.0.1"
}

ports {
  grpc                     = 8502
  grpc_tls                 = 8503
  http                     = 8500
  https                    = 8443
  dns                      = 8600
}

# DNS Recursion
recursors = ["1.1.1.1"]
dns_config {
  node_ttl = "10s"
  service_ttl {
   "*" = "15s"
   }
}

Curious if anyone has seen anything like this before?
Example of bad address happening

seq 1 50 | xargs  -P50 -I{} nc -zv grafana-postgres.service.consul 5443
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
grafana-postgres.service.consul (10.10.10.202:5443) open
nc: bad address 'grafana-postgres.service.consul'
nc: bad address 'grafana-postgres.service.consul'
nc: bad address 'grafana-postgres.service.consul'
nc: bad address 'grafana-postgres.service.consul'
nc: bad address 'grafana-postgres.service.consul'
nc: bad address 'grafana-postgres.service.consul'

If anyone comes across this, this issue wasn’t related to consul or nomad at all.
Was actual an issue with alpine and how it round robins dns.

Fixed by removing 1.1.1.1 from container’s dns configuration and added it for recursion in consul

recursors = ["1.1.1.1"]
1 Like