I’m having an issue where clients fail to connect to a static port from outside my local networks, and it appears when I dig into the problem that conntrack isn’t understanding what is happening in these cases and either dropping or resetting the connection, depending on the topology between the client and host, I guess. It is not a sporadic issue, I can reproduce 100% of the time and have a way to test both cases.
In the following dumps, app-proxy
is its envoy sidecar, and app-host
is veth0
as described in my interface config. I measured through the bridge described in my config.
❯ nomad --version
Nomad v1.3.3 (428b2cd8014c48ee9eae23f02712b7219da16d30)
❯ consul --version
Consul v1.13.1
Revision c6d0f9ec
Build Date 2022-08-11T19:07:00Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
routed.conntrack:
[NEW] tcp 6 120 SYN_SENT src=client dst=app-host sport=54616 dport=443 [UNREPLIED] src=app-proxy dst=client sport=443 dport=54616
[UPDATE] tcp 6 60 SYN_RECV src=client dst=app-host sport=54616 dport=443 src=app-proxy dst=client sport=443 dport=54616
[UPDATE] tcp 6 432000 ESTABLISHED src=client dst=app-host sport=54616 dport=443 src=app-proxy dst=client sport=443 dport=54616 [ASSURED]
[UPDATE] tcp 6 120 FIN_WAIT src=client dst=app-host sport=54616 dport=443 src=app-proxy dst=client sport=443 dport=54616 [ASSURED]
[UPDATE] tcp 6 30 LAST_ACK src=client dst=app-host sport=54616 dport=443 src=app-proxy dst=client sport=443 dport=54616 [ASSURED]
[UPDATE] tcp 6 120 TIME_WAIT src=client dst=app-host sport=54616 dport=443 src=app-proxy dst=client sport=443 dport=54616 [ASSURED]
tls.conntrack:
[NEW] tcp 6 120 SYN_SENT src=client dst=app-host sport=57600 dport=443 [UNREPLIED] src=app-proxy dst=client sport=443 dport=57600
[UPDATE] tcp 6 60 SYN_RECV src=client dst=app-host sport=57600 dport=443 src=app-proxy dst=client sport=443 dport=57600
[NEW] tcp 6 120 SYN_SENT src=client dst=app-host sport=57601 dport=443 [UNREPLIED] src=app-proxy dst=client sport=443 dport=57601
[UPDATE] tcp 6 60 SYN_RECV src=client dst=app-host sport=57601 dport=443 src=app-proxy dst=client sport=443 dport=57601
[UPDATE] tcp 6 432000 ESTABLISHED src=client dst=app-host sport=57600 dport=443 src=app-proxy dst=client sport=443 dport=57600 [ASSURED]
[NEW] tcp 6 300 ESTABLISHED src=app-host dst=client sport=443 dport=57600 [UNREPLIED] src=client dst=app-host sport=57600 dport=220
[UPDATE] tcp 6 432000 ESTABLISHED src=client dst=app-host sport=57601 dport=443 src=app-proxy dst=client sport=443 dport=57601 [ASSURED]
[NEW] tcp 6 300 ESTABLISHED src=app-host dst=client sport=443 dport=57601 [UNREPLIED] src=client dst=app-host sport=57601 dport=233
unrouted.conntrack:
[NEW] tcp 6 120 SYN_SENT src=client dst=app-host sport=54600 dport=443 [UNREPLIED] src=app-proxy dst=client sport=443 dport=54600
[UPDATE] tcp 6 60 SYN_RECV src=client dst=app-host sport=54600 dport=443 src=app-proxy dst=client sport=443 dport=54600
[UPDATE] tcp 6 432000 ESTABLISHED src=client dst=app-host sport=54600 dport=443 src=app-proxy dst=client sport=443 dport=54600 [ASSURED]
[UPDATE] tcp 6 120 FIN_WAIT src=client dst=app-host sport=54600 dport=443 src=app-proxy dst=client sport=443 dport=54600 [ASSURED]
[UPDATE] tcp 6 30 LAST_ACK src=client dst=app-host sport=54600 dport=443 src=app-proxy dst=client sport=443 dport=54600 [ASSURED]
[UPDATE] tcp 6 120 TIME_WAIT src=client dst=app-host sport=54600 dport=443 src=app-proxy dst=client sport=443 dport=54600 [ASSURED]
vpn.conntrack:
[NEW] tcp 6 120 SYN_SENT src=client dst=app-host sport=55264 dport=443 [UNREPLIED] src=app-proxy dst=client sport=443 dport=55264
[UPDATE] tcp 6 60 SYN_RECV src=client dst=app-host sport=55264 dport=443 src=app-proxy dst=client sport=443 dport=55264
[UPDATE] tcp 6 432000 ESTABLISHED src=client dst=app-host sport=55264 dport=443 src=app-proxy dst=client sport=443 dport=55264 [ASSURED]
[NEW] tcp 6 300 ESTABLISHED src=app-host dst=client sport=443 dport=55264 [UNREPLIED] src=client dst=app-host sport=55264 dport=407
[DESTROY] tcp 6 src=app-host dst=client sport=443 dport=55264 [UNREPLIED] src=client dst=app-host sport=55264 dport=407
[UPDATE] tcp 6 10 CLOSE src=client dst=app-host sport=55264 dport=443 src=app-proxy dst=client sport=443 dport=55264 [ASSURED]
My network setup is kind of exotic but I turned off my openstack setup running with multiple network namespaces, and now I’m down to bridge with a bunch of physical nics, one of which is connected to the internet, and a virtual nic on the bridge that the host uses as the primary interface:
❯ uname -a
Linux core 5.10.0-15-sme-amd64 #1 SMP Debian 5.10.120-1 (2022-06-09) x86_64 GNU/Linux
❯ cat /etc/network/interfaces | tail -n 19
######################
# Bridge
######################
auto br-ext
iface br-ext inet static
bridge_ports enp1s0f0 enp1s0f1 enp33s0f0 enp33s0f1 enp34s0f0 enp34s0f1 enp34s0f2 enp34s0f3 enp97s0f0 enp97s0f1 enp97s0f2 enp97s0f3 enp98s0f0 enp98s0f1 enp98s0f2 enp98s0f3 enp99s0f0 enp99s0f1 veth0-p
address 192.168.1.250
netmask 255.255.255.0
pre-up ip link add veth0 type veth peer name veth0-p && ip link set veth0 address 01:01:01:01:01:01
up brctl stp $IFACE on
post-down ip link delete veth0
######################
# Primary Interface
######################
auto veth0
iface veth0 inet dhcp
This is connected to a router, which is behind another router that is built into my modem. I have forwarded the appropriate ports, and in dev mode things work fine. I’m trying to configure Nomad, Consul and Vault for a staging environment right now. The big differences since I’ve tried dev mode are that I started using consul services rather than nomad’s mesh. I imagine something about CNI, Nomad or Consul or Docker is misconfigured. I’ve turned off Docker’s iptables flag to debug this (and for more information, I don’t believe I ever got it working in dev mode with iptables in docker enabled).
I am using nomad-pack to deploy, but that shouldn’t be relevant. I rendered the job and ran it without pack with the same results. I have tcpdump and iptables data ready as well.
Here is the most basic job that fails (resources are over-provisioned but the server can definitely handle it):
job "front_end" {
type = "service"
region = "global"
datacenters = ["dc1"]
group "front_end" {
network {
mode = "bridge"
port "https" {
static = 443
to = 443
}
}
task "react" {
driver = "docker"
vault {
policies = ["kv"]
}
config {
force_pull = true
image = "some-image"
}
template {
data = <<EOF
{{- with secret "kv/data/api_nginx_private_key" -}}
{{ .Data.data.value }}
{{- end -}}
EOF
destination = "secrets/nginx-private-key.pem"
change_mode = "signal"
change_signal = "SIGHUP"
}
template {
data = <<EOF
{{- with secret "kv/data/api_nginx_certificate" -}}
{{ .Data.data.value }}
{{- end -}}
EOF
destination = "secrets/nginx-certificate.pem"
change_mode = "signal"
change_signal = "SIGHUP"
}
resources {
cpu = 2000
memory = 3072
}
}
}
}
/etc/docker/daemon.json
{
"bip": "10.1.0.1/16",
"dns": ["8.8.8.8", "8.8.4.4"],
"iptables": false
}
❯ cat tls.host-external-bridge-view.tcpdump
21:54:10.205404 IP client.52676 > app-proxy.https: Flags [S], seq 419624508, win 65535, options [mss 1452,nop,wscale 6,nop,nop,TS val 32965802 ecr 0,sackOK,eol], length 0
21:54:10.205533 IP app-host.https > client.52676: Flags [S.], seq 180633708, ack 419624509, win 65160, options [mss 1460,sackOK,TS val 1118519366 ecr 32965802,nop,wscale 7], length 0
21:54:10.267621 IP client.52676 > app-proxy.https: Flags [.], ack 180633709, win 2070, options [nop,nop,TS val 32965870 ecr 1118519366], length 0
21:54:10.293478 IP client.52676 > app-proxy.https: Flags [P.], seq 0:323, ack 1, win 2070, options [nop,nop,TS val 32965879 ecr 1118519366], length 323
21:54:10.293548 IP app-host.https > client.52676: Flags [.], ack 324, win 507, options [nop,nop,TS val 1118519454 ecr 32965879], length 0
21:54:10.293912 IP app-host.https > client.52676: Flags [P.], seq 1:100, ack 324, win 507, options [nop,nop,TS val 1118519455 ecr 32965879], length 99
21:54:10.530491 IP client.52676 > app-proxy.https: Flags [P.], seq 0:323, ack 1, win 2070, options [nop,nop,TS val 32966116 ecr 1118519366], length 323
21:54:10.530543 IP app-host.https > client.52676: Flags [.], ack 324, win 507, options [nop,nop,TS val 1118519691 ecr 32966116,nop,nop,sack 1 {1:324}], length 0
21:54:10.569400 IP app-host.https > client.52676: Flags [P.], seq 1:100, ack 324, win 507, options [nop,nop,TS val 1118519730 ecr 32966116], length 99
21:54:10.857403 IP app-host.https > client.52676: Flags [P.], seq 1:100, ack 324, win 507, options [nop,nop,TS val 1118520018 ecr 32966116], length 99
21:54:10.922469 IP client.52676 > app-proxy.https: Flags [P.], seq 0:323, ack 1, win 2070, options [nop,nop,TS val 32966521 ecr 1118519366], length 323
21:54:10.922529 IP app-host.https > client.52676: Flags [.], ack 324, win 507, options [nop,nop,TS val 1118520083 ecr 32966521,nop,nop,sack 1 {1:324}], length 0
21:54:11.401433 IP app-host.https > client.52676: Flags [P.], seq 1:100, ack 324, win 507, options [nop,nop,TS val 1118520562 ecr 32966521], length 99
21:54:11.534554 IP client.52676 > app-proxy.https: Flags [P.], seq 0:323, ack 1, win 2070, options [nop,nop,TS val 32967130 ecr 1118519366], length 323
21:54:11.534613 IP app-host.https > client.52676: Flags [.], ack 324, win 507, options [nop,nop,TS val 1118520695 ecr 32967130,nop,nop,sack 1 {1:324}], length 0
21:54:11.612472 IP client.52676 > app-proxy.https: Flags [F.], seq 323, ack 1, win 2070, options [nop,nop,TS val 32967210 ecr 1118519366], length 0
21:54:11.612722 IP app-host.https > client.52676: Flags [F.], seq 100, ack 325, win 507, options [nop,nop,TS val 1118520773 ecr 32967210], length 0
21:54:12.489446 IP app-host.https > client.52676: Flags [FP.], seq 1:100, ack 325, win 507, options [nop,nop,TS val 1118521650 ecr 32967210], length 99
21:54:12.553046 IP client.52676 > app-proxy.https: Flags [FP.], seq 0:323, ack 1, win 2070, options [nop,nop,TS val 32968146 ecr 1118519366], length 323
21:54:12.553179 IP app-host.https > client.52676: Flags [.], ack 325, win 507, options [nop,nop,TS val 1118521714 ecr 32968146,nop,nop,sack 1 {1:325}], length 0
❯ cat tls.client-primary-interface-view.tcpdump
21:54:10.059007 IP client.52676 > app-host.https: Flags [S], seq 419624508, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 32965802 ecr 0,sackOK,eol], length 0
21:54:10.127439 IP app-host.https > client.52676: Flags [S.], seq 180633708, ack 419624509, win 65160, options [mss 1452,sackOK,TS val 1118519366 ecr 32965802,nop,wscale 7], length 0
21:54:10.127526 IP client.52676 > app-host.https: Flags [.], ack 1, win 2070, options [nop,nop,TS val 32965870 ecr 1118519366], length 0
21:54:10.135609 IP client.52676 > app-host.https: Flags [P.], seq 1:324, ack 1, win 2070, options [nop,nop,TS val 32965879 ecr 1118519366], length 323
21:54:10.372586 IP client.52676 > app-host.https: Flags [P.], seq 1:324, ack 1, win 2070, options [nop,nop,TS val 32966116 ecr 1118519366], length 323
21:54:10.777831 IP client.52676 > app-host.https: Flags [P.], seq 1:324, ack 1, win 2070, options [nop,nop,TS val 32966521 ecr 1118519366], length 323
21:54:11.387200 IP client.52676 > app-host.https: Flags [P.], seq 1:324, ack 1, win 2070, options [nop,nop,TS val 32967130 ecr 1118519366], length 323
21:54:11.466911 IP client.52676 > app-host.https: Flags [F.], seq 324, ack 1, win 2070, options [nop,nop,TS val 32967210 ecr 1118519366], length 0
21:54:12.402688 IP client.52676 > app-host.https: Flags [FP.], seq 1:324, ack 1, win 2070, options [nop,nop,TS val 32968146 ecr 1118519366], length 323
Help? It seems like the traffic is using the static port correctly and conntrack is expecting to see something on a dynamic port.
I’m also now realizing I may be able to run varying combinations of consul and nomad in dev mode using my current config to try and figure this out, but I don’t know enough about the software yet (I started using these tools just over a week ago) to know what might be going on, and I worry that I’d get incorrect results that would ultimately just slow me down.
I took quite a bit of care collecting this data but it’s possible I made a mistake. If something seems incorrect let me know and I’ll do another dump before debugging.
Thanks in advance!