Adding this for future me.
In the end the problem wasn’t really Nomad.
With Hetzner cloud, when you set up a private network, you need to pay attention to the name of the interface when you declare a host network, along these lines:
client {
host_network "hetzner" {
interface = "my-interface-name"
cidr = "10.0.0.0/24"
reserved_ports = "22,80"
}
}
This is because it isn’t consistent across all machine types.
For the older VMS, it was ens10
, but for the new machines it’s a different network interface name `
Network | CX, CCX*1 | CPX, CAX, CCX2, CCX3 |
---|---|---|
First attached network | ens10 | enp7s0 |
Additional interfaces (second) | ens11 | enp8s0 |
Additional interfaces (third) | ens12 | enp9s0 |
This is documented on the corresponding page in Hetzner’s docs.
Looking back what would have saved me lots of heartache
I’ve learned that Nomad lets you set up a host network with an named interface that doesn’t exist on a host machine, and it doesn’t give you any indication that you’re connecting to a non-existent network interface.
So this:
client {
host_network "hetzner" {
interface = "ens12345"
cidr = "10.0.0.0/24"
reserved_ports = "22,80"
}
}
Will show up in the node info output when you call:
nomad node status -verbose <node_id>
And in the info you’ll see output like this as text:
Host Networks
Name CIDR Interface ReservedPorts
hetzner 10.0.0.0/24 ens12345 22,80
Which can give the impression that everything is working, except when you try to place a job, you’ll see output like this:
Scheduler dry-run:
- WARNING: Failed to place all allocations.
Task Group "<MY_JOB_NAME>" (failed to place 1 allocation):
* Constraint "missing host network \"hetzner\" for port \"http\"": 1 nodes excluded by filter
The host network isn’t missing - it’s clearly there! I think it’s s more a case that the IP address allocated to the node can’t be determined, and as a result, the node fails the text against the constraint.
Having some logs when a nomad client starts up, warning that a host_network interface could not be determined would really help, because there appears to be a bunch of network output logged anyway on startup. Here’s some of the start-up logs:
Sep 15 14:23:32 app3.my-org.org nomad[80209]: 2023-09-15T14:23:32.853Z [WARN] client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=eth0
Sep 15 14:23:32 app3.my-org.org nomad[80209]: 2023-09-15T14:23:32.853Z [WARN] client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
Sep 15 14:23:32 app3.my-org.org nomad[80209]: 2023-09-15T14:23:32.856Z [WARN] client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=eth0
Sep 15 14:23:32 app3.my-org.org nomad[80209]: 2023-09-15T14:23:32.861Z [WARN] client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=enp7s0
Sep 15 14:23:32 app3.my-org.org nomad[80209]: 2023-09-15T14:23:32.864Z [WARN] client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=docker0
Seeing the interface name is what led me to the fix in the end.
I hope this helps someone else in future (or even future me, next time I’m troubleshooting network issues…)