Dear All,
Good day to you!
The motive of the question is to understand the end-to-end request/network flow when we do curl nginx-service.virtual.consu
l and the DNS related issues I am facing.
I have been working on K8s, nomad and Consul and I was able to connect both clusters together through consul server. I am using transparent proxy for both ends. It is working somehow. I was able to curl the service name nginx-service.virtual.consul
from k8s and nomad sides which gave me the results from either workloads running on k8s and nomad. But, I have some issues with DNS integration. Also, I am struggling with understanding the end-to-end flow that happens when we do url `nginx-service.virtual.consul until we get the result. I kindly seek your expertise to understand and rectify this.
Scenario:
I have connected K8s to an external Consul server using custom values.yaml file with helm.
**This helm values have been updated during the process and changes are inserted under the section What I did later in this post.
global:
enabled: false
logLevel: "debug"
tls:
enabled: false
externalServers:
enabled: true
hosts: ["192.168.60.10"]
httpsPort: 8500
server:
enabled: false
syncCatalog:
enabled: true
default: false
I am able to intermittently get responses from k8s and nomad but with frequent failures.
K8s pods and Services: Default Namespace
K8s Pods and Services: Consul Namespace
I get the following results from k8s pod when I run nslookup kubernetes.default
and cat /etc/resolv.conf
I see the following logs in k logs -f k8s-test-pod -c consul-dataplane
[debug] envoy.main(14) flushing stats
[debug] envoy.conn_handler(23) [Tags: "ConnectionId":"463"] new connection from 30.0.1.82:46850
[debug] envoy.connection(23) [Tags: "ConnectionId":"463"] closing socket: 0
[debug] envoy.conn_handler(23) [Tags: "ConnectionId":"463"] adding to cleanup list
[debug] envoy.main(14) flushing stats
[DEBUG] consul-dataplane.dns-proxy.udp: timeout waiting for read: error="read udp 127.0.0.1:8600: i/o timeout"
[debug] envoy.main(14) flushing stats
I do not see any pod running on IP 30.0.1.82
although it says new connection from. ERROR: [DEBUG] consul-dataplane.dns-proxy.udp: timeout waiting for read: error="read udp 127.0.0.1:8600: i/o timeout"
Also I see in consul-dataplane
it says -consul-dns-bind-port=8600
I have coreDNS pods running in K8s. There I can see the following logs when I run
k exec -it pod/k8s-test-pod -c k8s-test-pod-container – curl nginx-service.virtual.consul
Hello, I am running on Nomad!
[INFO] 30.0.1.70:57079 - 54434 "AAAA IN nginx-service.virtual.consul.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd 153 0.000420553s
[INFO] 30.0.1.70:57079 - 54174 "A IN nginx-service.virtual.consul.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd 153 0.000265927s
[INFO] 30.0.1.70:39432 - 35033 "A IN nginx-service.virtual.consul.default.svc.cluster.local. udp 72 false 512" NXDOMAIN qr,aa,rd 165 0.000224508s
[INFO] 30.0.1.70:39432 - 35303 "AAAA IN nginx-service.virtual.consul.default.svc.cluster.local. udp 72 false 512" NXDOMAIN qr,aa,rd 165 0.000090913s
[INFO] 30.0.1.70:53440 - 20961 "AAAA IN nginx-service.virtual.consul.svc.cluster.local. udp 64 false 512" NXDOMAIN qr,aa,rd 157 0.000257561s
[INFO] 30.0.1.70:53440 - 20712 "A IN nginx-service.virtual.consul.svc.cluster.local. udp 64 false 512" NXDOMAIN qr,aa,rd 157 0.000184247s
[INFO] 30.0.1.70:32838 - 11880 "A IN nginx-service.virtual.consul. udp 46 false 512" NXDOMAIN qr,rd,ra 121 0.006471083s
[INFO] 30.0.1.70:32838 - 12132 "AAAA IN nginx-service.virtual.consul. udp 46 false 512" NXDOMAIN qr,rd,ra 121 0.00661917s
k exec -it pod/k8s-test-pod -c k8s-test-pod-container – curl nginx-service.virtual.consul
Hello, I am running on Kubernetes!
[INFO] 30.0.1.70:47717 - 3245 "A IN nginx-service.virtual.consul.default.svc.cluster.local. udp 72 false 512" NXDOMAIN qr,aa,rd 165 0.000243007s
[INFO] 30.0.1.70:47717 - 3553 "AAAA IN nginx-service.virtual.consul.default.svc.cluster.local. udp 72 false 512" NXDOMAIN qr,aa,rd 165 0.000565974s
[INFO] 30.0.1.70:60301 - 49101 "AAAA IN nginx-service.virtual.consul. udp 46 false 512" NXDOMAIN qr,rd,ra 121 0.006873433s
[INFO] 30.0.1.70:60301 - 48863 "A IN nginx-service.virtual.consul. udp 46 false 512" NXDOMAIN qr,rd,ra 121 0.057510109s
[INFO] 30.0.1.70:41297 - 15343 "AAAA IN nginx-service.virtual.consul.svc.cluster.local. udp 64 false 512" NXDOMAIN qr,aa,rd 157 0.000129459s
[INFO] 30.0.1.70:41297 - 15080 "A IN nginx-service.virtual.consul.svc.cluster.local. udp 64 false 512" NXDOMAIN qr,aa,rd 157 0.000063974s
[INFO] 30.0.1.70:42347 - 47519 "AAAA IN nginx-service.virtual.consul.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd 153 0.000094362s
[INFO] 30.0.1.70:42347 - 47275 "A IN nginx-service.virtual.consul.cluster.local. udp 60 false 512" NXDOMAIN qr,aa,rd 153 0.000055806s
What I did
Added DNS block to the custom values.yaml file and re-executed it with helm.
dns:
enabled: true
enableRedirection: true
Updated the coredns configmap with following values
consul {
errors
cache 30
forward . 10.97.111.170
}
10.97.111.170
is the ClusterIP of service/consul-consul-dns
.
Then I could continuously curl without any intermittent errors like before
Also, then I observed the following errors in core-dns pod logs
30.0.1.118
is the IP of coreDNS pod.
However, I still get below error continuously when I check logs in k logs -f pod/k8s-test-pod -c consul-dataplane
I do not see any IP 30.0.1.82 in k8s. I checked all namespaces.
I still get the following error as well
But I get below result when running dig nginx-service.virtual.consul
I am not getting why this still happens although the connection works quite ok.
I was thinking when we curl to nginx-service.virtual.consul
from a k8s pod, it should first go to coreDNS and since there is .consul domain it should forward the request to consul-dns service. From there it will get the IP and Port of the sidecar proxy container running along with the pod. So then the request will forward to the sidecar which will forward the request to other (nomad cluster’s) side car. Please correct me if I am wrong.
I am bit stuck with understanding how the flow is working and why DNS is giving this error even I could access the result from either clusters successfully.
I am sincerely looking for any assistance.
Thank you!