Confused about Consul DNS / Can't get it to work with KrakenD

(Note: All of this is running on Ubuntu Server 18.04 LTS - virgin VMs)

Long story short, on the KrakenD server, if I do a ‘dig SRV’ for a registered service the way KrakenD would, I get this:
$ dig identity-server.service.consul SRV +short
1 1 80 0a1e3730.addr.stage-vm.consul.

Can anyone explain the results to me? I don’t care about the 1s, and 80 is the port, but what’s “0a1e3730.addr”? (I know “stage-vm” is the data center and “consul” is obvious).

As I’ve been looking at other people’s examples online, I would’ve expected the result to look something like “identity-server.node.stage-vm.consul.” Or maybe an actual IP address?

Here’s the trick: If I use “0a1e3730.addr.stage-vm.consul.” in curl, it works fine…

KrakenD is configured to use “identity-server.service.consul”. Looking at tcpdump, it does appear to go to Consul, which in turn returns “0a1e3730.addr.stage-vm.consul.”, but that’s the end - from what I can tell, it asks Consul, gets an answer, and tries to use it – and returns “no hosts available” (so KrakenD thinks that thing is a hostname?)

I know there’s a lot of information here, but to summarize: I’m worried that Consul is returning something “weird” (for lack of a better word), and it’s breaking KrakenD. Since I don’t understand what it’s returning (or why), I don’t know what to look at to fix this (and for all I know this is just a red herring).

One more thing related to all this which I find interesting: If I run the dig command above without the +short option, it still just returns that one record I showed above.

But if I run it by specifying Consul as the NS, I get more information:
$ dig @10.30.54.161 identity-server.service.consul. SRV

;; ADDITIONAL SECTION:
0a1e3730.addr.stage-vm.consul. 0 IN     A       10.30.55.48
consul1-vm.node.stage-vm.consul. 0 IN   TXT     "consul-network-segment="

This whole Additional Section is missing from the first dig query… and to my eyes (as a Linux noob), it looks important. But I have no idea why the queries are different - in theory both queries are going to the same Consul instance.

For background, the DNS resolution is being done via systemd. The resolved.conf file is pointing to Consul, and I have some iptables records to map the port from 53 to 8600. I know this all works (at least on some level) because curl works fine with “identity-server.service.consul”

I’ve been digging away at this for days. I appreciate any help, even if it’s just fragments of an answer, it may be enough to at least get me pointed in the right direction.

First, its important to note that the SRV record is going to contain a hostname and port. From RFC 2782

Target

The domain name of the target host. There MUST be one or more address records for this name, the name MUST NOT be an alias (in the sense of RFC 1034 or RFC 2181). Implementors are urged, but not required, to return the address record(s) in the Additional Data section. Unless and until permitted by future standards action, name compression is not to be used for this field. A Target of “.” means that the service is decidedly not available at this domain.

Consul’s DNS does not have the ability to reference an individual service instance such as service identity-service running on node foo (I might write up a feature request for how I think we could accomplish this though as it seems like we could make it work). So the problem then is that Consul must return some name in the Target of the SRV record that a resolver can turn around and make successful A/AAAA queries for.

When the service address matches the nodes address it is registered with, Consul will return a target of <node name>.node.<datacenter>.consul.. This is because Consul’s DNS already has the capability to uniquely reference an individual node. However when the service address is different Consul must provide some name that a resolver can then query for A/AAAA records. The chosen approach Consul takes at the moment is to return something like: <hex encoded ip>.addr.<datacenter>.consul.

While the name there might look a little odd, this is in fact how SRV records are supposed to work. The specifics of the hext encoding is Consul specific but definitely falls within the DNS spec.

As for the actual issue you are experiencing, my guess is that the systemd resolver is not handling the ADDITIONAL section and forwarding it on to you. This is not great as that section is meant to prevent additional round trips back to the DNS server to resolve things that the server already provided you because it knew you would need it. However even in the case where the ADDITIONAL section is not used a DNS resolver should make an A or AAAA record query after receiving the SRV RR. If you are not seeing that happen in the packet capture then its something wrong with the resolver.

All very interesting stuff! I will go back to KrackenD and see what they have to say about all this!

Just to clarify, when you say “When the service address matches the nodes address it is registered with” do you mean “When the service is running on the same machine as Consul”? Because that’s the opposite of our architecture - we plan on having all the services run on different machines, and Consul run on other machines still.

When you register a service with a consul agent you specify it like so:

{
  "service": {
    "id": "redis",
    "name": "redis",
    "tags": ["primary"],
    "address": "",
    "meta": {
      "meta": "for my service"
    }
}

When the address field is empty or unspecified then the agent/node address of the agent you are registering the service with will be used for DNS replies.

Hmm, I’m not sure I follow: I changed the address to blank, and had it register itself with a Consul CLIENT agent (NOT server) and now I do in fact get the “normal” host names, like this:

$ dig +short identity-server.service.consul. SRV

1 1 80 consul1-vm.node.stage-vm.consul.

But if I then dig that:
$ dig +short consul1-vm.node.stage-vm.consul.
The IP address it returns is of a that Consul SERVER (consul1-vm). We’re going to have a ton of services, but we don’t want to run a Consul server for each one - we only want to run a handful or servers (if we have to run a bunch of Consul clients then that’s ok).

I appreciate your help!

Running a few servers is the recommended way. Typically either 3 or 5.

As for it returning the IP of the server node and not the client. That is odd. How did you register them: config file or API. If through the API did you use the Agent API or the Catalog API.

If you curl <consul addr>/v1/catalog/service/identity-server it should show you the full service definition which will look something like the following:

[
  {
    "ID": "40e4a748-2192-161a-0510-9bf59fe950b5",
    "Node": "foobar",
    "Address": "192.168.10.10",
    "Datacenter": "dc1",
    "TaggedAddresses": {
      "lan": "192.168.10.10",
      "wan": "10.0.10.10"
    },
    "NodeMeta": {
      "somekey": "somevalue"
    },
    "CreateIndex": 51,
    "ModifyIndex": 51,
    "ServiceAddress": "172.17.0.3",
    "ServiceEnableTagOverride": false,
    "ServiceID": "32a2a47f7992:nodea:5000",
    "ServiceName": "foobar",
    "ServicePort": 5000,
    "ServiceMeta": {
        "foobar_meta_value": "baz"
    },
    "ServiceTaggedAddresses": {
      "lan": {
        "address": "172.17.0.3",
        "port": 5000
      },
      "wan": {
        "address": "198.18.0.1",
        "port": 512
      }
    },
    "ServiceTags": [
      "tacos"
    ],
  }
]

The Node name at the top level of each object in the list is the name of the Consul node that the service was registered with. If you used the Agent API of the client to register then that Consul clients node name should be there. Otherwise, if you used the Catalog API then you could have provided any node name.

Thanks again for the info. I’m using the Agent API. I’m going to pour over your notes and double check everything on my end. Cheers!

We can close this. I got Consul to return “normal” records (without the hex values) but it didn’t fix the underlying issue: Go made a change where it no longer accepts compressed DNS records. Thanks again for your help!