Facing DNS Issues [Consul + Kubernetes]

Dear All,

Good day to you!

The motive of the question is to understand the end-to-end request/network flow when we do curl nginx-service.virtual.consul and the DNS related issues I am facing.

I have been working on K8s, nomad and Consul and I was able to connect both clusters together through consul server. I am using transparent proxy for both ends. It is working somehow. I was able to curl the service name nginx-service.virtual.consul from k8s and nomad sides which gave me the results from either workloads running on k8s and nomad. But, I have some issues with DNS integration. Also, I am struggling with understanding the end-to-end flow that happens when we do url `nginx-service.virtual.consul until we get the result. I kindly seek your expertise to understand and rectify this.

Scenario:
I have connected K8s to an external Consul server using custom values.yaml file with helm.

**This helm values have been updated during the process and changes are inserted under the section What I did later in this post.

global:
  enabled: false
  logLevel: "debug"
  tls:
    enabled: false
externalServers:
  enabled: true
  hosts: ["192.168.60.10"]
  httpsPort: 8500
server:
  enabled: false
syncCatalog:
  enabled: true
  default: false

I am able to intermittently get responses from k8s and nomad but with frequent failures.

K8s pods and Services: Default Namespace

K8s Pods and Services: Consul Namespace

I get the following results from k8s pod when I run nslookup kubernetes.default and cat /etc/resolv.conf

I see the following logs in k logs -f k8s-test-pod -c consul-dataplane

[debug] envoy.main(14) flushing stats
[debug] envoy.conn_handler(23) [Tags: "ConnectionId":"463"] new connection from 30.0.1.82:46850
[debug] envoy.connection(23) [Tags: "ConnectionId":"463"] closing socket: 0
 [debug] envoy.conn_handler(23) [Tags: "ConnectionId":"463"] adding to cleanup list
[debug] envoy.main(14) flushing stats
[DEBUG] consul-dataplane.dns-proxy.udp: timeout waiting for read: error="read udp 127.0.0.1:8600: i/o timeout"
 [debug] envoy.main(14) flushing stats

I do not see any pod running on IP 30.0.1.82 although it says new connection from. ERROR: [DEBUG] consul-dataplane.dns-proxy.udp: timeout waiting for read: error="read udp 127.0.0.1:8600: i/o timeout"

Also I see in consul-dataplane it says -consul-dns-bind-port=8600

I have coreDNS pods running in K8s. There I can see the following logs when I run

k exec -it pod/k8s-test-pod -c k8s-test-pod-container – curl nginx-service.virtual.consul
Hello, I am running on Nomad!

[INFO] 30.0.1.70:57079 - 54434 "AAAA IN nginx-service.virtual.consul.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd 153 0.000420553s
[INFO] 30.0.1.70:57079 - 54174 "A IN nginx-service.virtual.consul.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd 153 0.000265927s
[INFO] 30.0.1.70:39432 - 35033 "A IN nginx-service.virtual.consul.default.svc.cluster.local. udp 72 false 512" NXDOMAIN qr,aa,rd 165 0.000224508s
[INFO] 30.0.1.70:39432 - 35303 "AAAA IN nginx-service.virtual.consul.default.svc.cluster.local. udp 72 false 512" NXDOMAIN qr,aa,rd 165 0.000090913s
[INFO] 30.0.1.70:53440 - 20961 "AAAA IN nginx-service.virtual.consul.svc.cluster.local. udp 64 false 512" NXDOMAIN qr,aa,rd 157 0.000257561s
[INFO] 30.0.1.70:53440 - 20712 "A IN nginx-service.virtual.consul.svc.cluster.local. udp 64 false 512" NXDOMAIN qr,aa,rd 157 0.000184247s
[INFO] 30.0.1.70:32838 - 11880 "A IN nginx-service.virtual.consul. udp 46 false 512" NXDOMAIN qr,rd,ra 121 0.006471083s
[INFO] 30.0.1.70:32838 - 12132 "AAAA IN nginx-service.virtual.consul. udp 46 false 512" NXDOMAIN qr,rd,ra 121 0.00661917s

k exec -it pod/k8s-test-pod -c k8s-test-pod-container – curl nginx-service.virtual.consul
Hello, I am running on Kubernetes!

[INFO] 30.0.1.70:47717 - 3245 "A IN nginx-service.virtual.consul.default.svc.cluster.local. udp 72 false 512" NXDOMAIN qr,aa,rd 165 0.000243007s
[INFO] 30.0.1.70:47717 - 3553 "AAAA IN nginx-service.virtual.consul.default.svc.cluster.local. udp 72 false 512" NXDOMAIN qr,aa,rd 165 0.000565974s
[INFO] 30.0.1.70:60301 - 49101 "AAAA IN nginx-service.virtual.consul. udp 46 false 512" NXDOMAIN qr,rd,ra 121 0.006873433s
[INFO] 30.0.1.70:60301 - 48863 "A IN nginx-service.virtual.consul. udp 46 false 512" NXDOMAIN qr,rd,ra 121 0.057510109s
[INFO] 30.0.1.70:41297 - 15343 "AAAA IN nginx-service.virtual.consul.svc.cluster.local. udp 64 false 512" NXDOMAIN qr,aa,rd 157 0.000129459s
[INFO] 30.0.1.70:41297 - 15080 "A IN nginx-service.virtual.consul.svc.cluster.local. udp 64 false 512" NXDOMAIN qr,aa,rd 157 0.000063974s
[INFO] 30.0.1.70:42347 - 47519 "AAAA IN nginx-service.virtual.consul.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd 153 0.000094362s
[INFO] 30.0.1.70:42347 - 47275 "A IN nginx-service.virtual.consul.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd 153 0.000055806s

What I did

Added DNS block to the custom values.yaml file and re-executed it with helm.

dns:
  enabled: true
  enableRedirection: true

Updated the coredns configmap with following values

consul {
        errors
        cache 30
        forward . 10.97.111.170
    }

10.97.111.170 is the ClusterIP of service/consul-consul-dns.

Then I could continuously curl without any intermittent errors like before

Also, then I observed the following errors in core-dns pod logs

30.0.1.118 is the IP of coreDNS pod.

However, I still get below error continuously when I check logs in k logs -f pod/k8s-test-pod -c consul-dataplane I do not see any IP 30.0.1.82 in k8s. I checked all namespaces.

I still get the following error as well

But I get below result when running dig nginx-service.virtual.consul

I am not getting why this still happens although the connection works quite ok.

I was thinking when we curl to nginx-service.virtual.consul from a k8s pod, it should first go to coreDNS and since there is .consul domain it should forward the request to consul-dns service. From there it will get the IP and Port of the sidecar proxy container running along with the pod. So then the request will forward to the sidecar which will forward the request to other (nomad cluster’s) side car. Please correct me if I am wrong.

I am bit stuck with understanding how the flow is working and why DNS is giving this error even I could access the result from either clusters successfully.

I am sincerely looking for any assistance.

Thank you!

I tried to come up with an draft diagram to understand the flow. However, I am unable to complete the flow since I am stuck with understanding how to integrate DNS and how the rest of the flow takes place.

I am calling nginx-service.virtual.consul from inside k8s-test-pod. Also, on the Nomad point of view I am calling nginx-service.virtual.consul from nomad test pod. Either way I am getting response from either K8s or Nomad. I have used the same service name for both k8s and nomad workloads.

This is the diagram I have for now.

Any advice on this along with DNS issue would be much appreciated.

Thank you!

There are a few things to unwrap here. Before I go into the details of the network flow, could you swap your test-pod container with a non-alpine image and see if it works consistently?

Also, you can ignore the timeout waiting for read error from the dataplane container (I will not explain it here to keep things simple).


Now, let us see how the request flow works in the transparent proxy scenario.

This example will use two services, static-client (downstream) and static-server (upstream).

Virtual Addresses

First, we should understand service-tagged Addresses. When you run a pod with Connect Injection, the service is registered into the Consul Catalog with two virtual addresses.

/ $ curl https://0:8501/v1/catalog/service/static-server-sidecar-proxy -ks | jq '.[].ServiceTaggedAddresses'
{
  "consul-virtual": {
    "Address": "240.0.0.1",
    "Port": 20000
  },
  "virtual": {
    "Address": "10.43.44.168",
    "Port": 80
  }
}
  • In the above addresses, the consul-virtual address is generated by Consul and it has no existence in the cluster (it is purely a virtual address, and Consul generates 1 per service). We will see why this is required shortly.
  • The second tagged address is the ClusterIP of the K8S Service associated with the pod.

Virtual Address → FilterChain Matches

These tagged addresses are populated in the downstream envoy proxy’s filter-chain match rules. These filter-chain match rules are used by Envoy to decide where to send the outbound requests that match specific IP addresses.

  {
   "@type": "type.googleapis.com/envoy.admin.v3.ListenersConfigDump",
   "dynamic_listeners": [
    {
     "name": "outbound_listener:127.0.0.1:15001",
     "active_state": {
      "version_info": "cdf62a5d0b2e0eeff7c58d6cab8ced053f5e78d7a0c7565e7ffbe5a6aa3a6873",
      "listener": {
       "@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
       "name": "outbound_listener:127.0.0.1:15001",
       "address": {
        "socket_address": {
         "address": "127.0.0.1",
         "port_value": 15001
        }
       },
       "filter_chains": [
        {
         "filter_chain_match": {
          "prefix_ranges": [
           {
            "address_prefix": "10.43.44.168",
            "prefix_len": 32
           },
           {
            "address_prefix": "240.0.0.1",
            "prefix_len": 32
           }
          ]
         },
         "filters": [
          {
           "name": "envoy.filters.network.tcp_proxy",
           "typed_config": {
            "@type": "type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy",
            "stat_prefix": "upstream.static-server.default.default.dc1",
            "cluster": "static-server.default.dc1.internal.e0b94ddb-90df-3415-cba6-05321f352acf.consul"
           }
          }
         ]
        }
       ],

In the above envoy config, you will see the service-tagged address in the filter_chain_match rule. The rule says that if a request comes to the outbound listener (port 15001 on localhost) and if the destination IP of the request is either of those IPs, it should be sent to the static-server cluster.

Now, how would the traffic flow to port 15001? That is where the transparent proxy kicks in. The transparent proxy will set up iptables rules to force all outbound traffic to be sent to the outbound listener of Envoy.

The Envoy proxy will have all the upstream services registered into Consul (from K8S, and Nomad in your case) populated as envoy clusters and endpoints. (you can see these by querying curl localhost:19000/clusters)

Consul Dataplane DNS Proxy

This is another important thing to understand when you enable Transparent Proxy. The consul-dataplane container (in connect-injected pods) runs a DNS Proxy that connects to Consul Server Pods (over the gRPC port) to serve DNS requests. When you run transparent proxy, and when dns.enableRedirection is set to true. Every pod gets 127.0.0.1 as its first DNS in /etc/resolv.conf

❯ k get pods static-client-779f57cdbc-5kqsb -o jsonpath="{.spec.dnsConfig.nameservers}"
["127.0.0.1","10.43.0.10"]%

With this in place, the first nameserver to query will be 127.0.0.1, and if it fails to return answer, the DNS resolver will then query KubeDNS (the second IP in the above list).

So static-server.virtual.consul will be resolved from 127.0.0.1, but static-server, will be resolved by the KubeDNS (and would return the ClusterIP associated with the K8S service)

Request Flow

Now, let us look at how a request is made and the entire workflow. We will use curl as the application that initiates the request to an upstream.

  • From the downstream pod, you run curl static-server.virtual.consul or curl static-server.
  • As explained in the previous section, the DNS resolver will query the DNS and either the consul-virtual IP, or the clusterIP will be returned (according to the query string)
  • The request is initiated towards the IP that was returned by the resolver. This is the exciting bit, the client will try to connect the consul-virtual address (even though that IP doesn’t exist anywhere).
  • Because of transparent proxy, the iptables rules inside the pod, will force the request to be redirected to the Envoy Outbound Listener (port 15001 on 127.0.0.1)
  • Now, Envoy will look at the destination IP in the connection, and try to match it with the filter_chain_match (explained above) and then re-route the traffic to one of the clusters (envoy picks the upstream round-robin).

This is why the Kubernetes pod can talk to nomad allocation (and vice versa) even when neither Nomad nor K8S knows about each other.

I hope this makes it clear how a non-existent IP (240.0.0.x) is returned for the .virtual.consul DNS queries and how the request magically reaches the destination service, even crossing the boundaries of K8S (and vice-versa).

Consul orchestrates all of this.

I hope this helps.

Ref: Transparent Proxy on Consul Service Mesh

1 Like

Dear @Ranjandas,

I am so thankful for your response.

I changed the k8s-test-pod container’s alpine image to an ubuntu image and the connection refused error disappeared.

Before change:

After change:

Now only getting these NXDOMAIN messages. I think that is because Kubernetes appends prefixes like .svc.cluster.local at the end of the nginx-service.virtual.consul. But, I don’t think that’ll have any issue keeping it like that.

I am able to understand the request flow with your nice explanation. Thank you so much for that. Also, I have completed the diagram as per my understanding. I am attaching it here and sincerely appreciate if you could kindly take a look and see if something is wrong.

  1. when using curl nginx-service.virtual.consul from k8s-test-pod it goes to consul dataplane’s DNS resolver to query the DNS.
  2. It returns the consul-virtual IP back.
  3. If curl nginx-service used consul-dataplane will send the request to coreDNS.
  4. Consul-dataplane will then return the ClusterIP associated with the k8s service.
  5. Then, when k8s-test-pod tries to reach either consul-virtual or Cluster IP of service the request is intercepted by the transparent proxy’s iptables.
  6. IP tables forces the request to the Envoy proxy.
  7. Thereafter, Envoy according to the destination IP routes the traffic to envoy proxy inside the nginx pod’s side-car in k8s.
  8. Then Envoy proxy forwards the request to the nginx container.
  9. Envoy proxy in k8s-test-pod will forward the request to connect-proxy-nginx-service allocation in Nomad.
  10. It will then forward the request to Nginx running in Nomad.

Notes:

  • Steps 7 and 9 will occur in round robin fashion.
  • I added 02 arrows from k8s and nomad to show they are connected to DC1 in Consul control plane. Other than that I do not know if we have to integrate Consul Control Plane into request flows.

The flow remains the same when started from Nomad side nomad-test-pod.

Please correct me if I have done any mistakes while creating the diagram.

Thank you once again!

A few corrections here:

  1. The consul-dataplane DNS proxy only talks to Consul DNS (running in the Consul Servers; in your case external server) and not CoreDNS.
  2. When nginx-service is queried, the dataplane DNS proxy won’t return any response. It is the system resolver that queries the next DNS from /etc/resolv.conf and gets back the ClusterIP

The rest of it is accurate. :+1:

1 Like

Dear @Ranjandas,

Thank you so much for the clarification.

I sincerely appreciate your time, advices and support. Those are absolutely invaluable for me.

Have a nice day!

1 Like