Possible bug: Constraint violation

I am running Nomad 1.0.4 and recently found a constraint violation for one of our jobs. This job has this constraint stanza:

  constraint {
    operator  = "distinct_property"
    attribute = "${node.datacenter}"
    value     = "1"
  }

It’s a singleton job that can only run once per datacenter. Our datacenters are connected to the server cluster via a VPN tunnel. On one particular day we had an issue with the VPN tunnel where it was constantly flapping for about 6 hours. This resulted in our Nomad clients at the affected datacenter to constantly disconnect and re-connect. The jobs at the affected datacenter were being rescheduled every time the clients reconnected. Once the VPN tunnel stabilized we noticed that Nomad was running 2 allocations of this job, even though the constraint stanza specifies to run only 1 per dc. Has anyone seen this before? I wonder if we hit an edge case bug with nomad constraints, and it’s likely hard to reproduce given the flapping VPN tunnel for 6 hours.

Architecture-wise, if we want our jobs at various datacenters (physical facilities) to be resilient to network connection outages to the Nomad server cluster, and not continually restart if the connection is flapping, should we run a Nomad server cluster at each datacenter?

Hi @johnnyplaydrums

Thanks for using Nomad. I’m sorry to hear that you are experiencing challenges. If I understand what you are reporting correctly, it should definitely be considered a bug. Feel free to log an issue on GitHub with as much data as you can provide. As you stated, it could be difficult to troubleshoot or reproduce, but that doesn’t mean it isn’t worth logging.

In terms of your architecture question, you could take that approach if you experience frequent loss of connection or latency between datacenters that results in nodes being marked as down. Nomad employs a heartbeat mechanism to determine and maintain a centralized view of which cluster nodes are healthy and operational. When operating workloads over an unstable network connection, Nomad clients may fail to heartbeat within the specified threshold. If a client node fails its heartbeat check:

  • The client node status is set to down
  • All its allocations are marked as lost
  • The scheduler will queue up replacement evaluations for the now lost allocations
  • Active allocations on the disconnected client will, by default, continue to run, but may optionally be configured to stop on the disconnected client after a user-specified timeout.

When the lost client reconnects, Nomad will either restart or remove the original allocations depending on constraints and the state of the rest of the cluster. If that behavior is not what you want, you could consider running each datacenter with separate servers, and then federating them if they need to communicate.

I will also mention that this feature request issue exists, and we are using it to consider ways we can improve the experience for users running workloads with this sort of deployment profile. Feel free to comment on that issue as well if you would like your input considered. We’d love to hear from you!

Cheers,

Derek