Consul: v1.20.4
Nomad: v1.9.6
Cluster Setup: 5-node Nomad + Consul cluster (staging environment)
When I power off all servers and restart them, some allocations fail to start due to the following constraint failure:
Constraint ${attr.consul.version} semver >= 1.8.0
Upon investigating the Nomad attributes on affected nodes, I noticed that the consul-connect attribute is missing on some servers. This issue appears to occur randomly, with different servers being affected on different reboots.
What I’ve Tried:
My nomad.service systemd unit file includes the following lines:
Wants=consul.service
After=consul.service
This should ensure Nomad starts only after Consul, but the issue persists.
If I modify the service file to include a bash script that waits for all Consul clients to be ready before starting Nomad, or even just add a fixed sleep delay, then:
The consul-connect attribute is always correctly assigned.
All allocations run successfully.
Some months ago, when I tested similar scenarios, I did not observe this behavior.
It seems to have started happening with these specific versions of Nomad and Consul.
Has anyone else experienced this issue with recent Nomad/Consul versions?
Could this be a race condition where Nomad starts before Consul is fully functional, even though systemd dependencies are correctly set?
Is there a better way to ensure Nomad properly detects Consul Connect without relying on manual delays?
Any insights would be greatly appreciated!