Old nomad in Debian 12 cannot resolve vault.service.consul

Hello,

Due to legacy software reasons, we are stuck using nomad 0.12. Recently, we deployed a new node that contrary to all others, uses Debian 12 as OS (all others run CentOS)

While at the beginning it seemed to work fine, it has recently started failing and its not able to run jobs that use vault. Client configuration contains:

vault {
  enabled = true
  address = "http://vault.service.consul:8200"
}

In the nomad logs the following can be seen:

dial tcp: lookup vault.service.consul: no such host

The node runs Nomad, Consul, Vault and dnsmasq as systemd services. They are all running. Even more strange, from the same node, doing:

dig vault.service.consul

or

host vault.service.consul

works fine (the dns entry is resolved to the ips of the cluster)

Restarting nomad shows in the logs that the fingerprinter does not detect an installation of vault (probably because it fails to resolve vault.service.consul). This theory is supported by the fact that if we change the nomad client configuration to use the node ip (instead of vault.service.consul) to reach vault, vault installation is detected and vault-using jobs can be scheduled.

I am looking for some help trying to figure out how to fix this in order to keep using vault.service.consul in the configuration. Just for some context, this problem does not exist in the nodes running CentOS.

Thanks in advance :sweat_smile:

Out of interest are you using one of the options for Consul DNS forwarding mentioned here:

I know for instance that there are some issues with older versions of systemd-resolved that don’t allow for setting the port so seeing what you’ve got setup for Consul DNS would be very useful.

Hey, thanks for the reply.

In order to be consistent with the other CentOS nodes, I set it up with dnsmasq. In order to get it working with Debian 12 i did the following:

First installed dnsmasq .

Then created the following file under /etc/systemd/resolved.conf.d/dnsmasq.conf :

[Resolve]
DNSStubListener=no
DNS=127.0.0.1 168.63.129.16

And the following under /etc/dnsmasq.d/10-consul.conf

server=168.63.129.16
server=/service.consul/<SERVER_PRIVATE_IP>#8600
max-cache-ttl=5
log-queries
cache-size=999

Then restarted systemd-resolved and dnsmasq. The 168.63.129.16 ip is azure’s magic ip for resolving (just consider it a custom public dns server for non consul requests)

I thought the setup was working, because this works :

root@nomadw01-deb:~# dig +short vault.service.consul
10.x.x.x
10.x.x.x
10.x.x.x
10.x.x.x
10.x.x.x
10.x.x.x
root@nomadw01-deb:~# dig +short google.com
172.217.23.110

with the following in the dnsmasq logs:

Jan 09 16:10:05 nomadw01-deb dnsmasq[722]: forwarded vault.service.consul to <SERVER_PRIVATE_IP>#8600
Jan 09 16:10:05 nomadw01-deb dnsmasq[722]: reply vault.service.consul is 10.X.X.X
Jan 09 16:14:50 rg-nomadw01-deb dnsmasq[722]: query[A] google.com from 127.0.0.1
Jan 09 16:14:50 rg-nomadw01-deb dnsmasq[722]: forwarded google.com to 168.63.129.16
Jan 09 16:14:50 rg-nomadw01-deb dnsmasq[722]: reply google.com is 172.217.23.110

The confusing part is that it works when running dig and host commands, but nomad does not seem to resolve it.

One thing that stands out to me is what grep hosts /etc/nsswitch.conf returns

On ubuntu 22.04 I get
hosts: files mdns4_minimal [NOTFOUND=return] dns

This can cause differences between services that talk directly to the DNS vs using the files.

In the dnsmasq config I have recollections of the order mattering. I don’t believe it should but having the filter before the catch all usually makes sense so I would be inclined to put server=168.63.129.16 below the consul line.

Also is consul listening on all interfaces or just the external interface? If all, then the line could be changed to server=/service.consul/127.0.0.1#8600 which again shouldn’t be a problem but removes some of the logic setting it up.

Hi!

The command you provided returns the following:

hosts:          files resolve [!UNAVAIL=return] dns

I will try out changing the order just for peace of mind, but I was also under the impression that it does not matter

Consul is listening just on the external interface.

Thanks for your help :slight_smile:

From what I understand, which isn’t a great deal, the default systemd-resolved listens on localhost only, I’m wondering if there’s something happening with the external interface to stop that working. If you set dnsmasq to run on port 53 and set your resolv.conf manually to nameserver 127.0.0.1 does that work?