Designated "canary" hosts/datacenter

I’m investigating using Nomad as a deployment scheduler for some of our services. Our service is global, with multiple geographically distributed datacenters. One of my goals is to be able to designate one datacenter as the “canary” datacenter. Whenever we do an update, I want the deployment to replace all the running tasks in that datacenter first. Then once it passes validation, it can go out everywhere else.

Is there a way to accomplish something like this in Nomad?
After scanning through the docs, I can’t find any way to control deployment ordering, or to ensure that all tasks in the datacenter get updated. I don’t want the documented Nomad canary behavior where it launches new tasks, leaving the old ones in that datacenter running. I want the whole datacenter running the updated version while validation is performed.

Can this be done?

Hi @phemmer,

Thanks for using Nomad.

I think what you want to do is feasible. Blue/Green deployments can be used to make sure only one version is running at a time, and the old ones get shutdown. The trick to a blue/green is that the canary value matches the count value which results in a switch from blue to green all in one step.

In terms of targeting just one datacenter, I think constraints are what you want to use to manage that.

Does that meet your use case?

  • Derek
1 Like

I personally have no need for this, but just wondering how the canaries will limit themselves to “a specific DC” ? :thinking:

I can understand if the datacenters parameters was templatized somehow and a “canary” launch is done with a single value and then when everything seems fine, launch with the all value.

(still not able to process this as a thought experiment) :slight_smile:

That’s a great question. Thanks for keeping me honest :grinning:

I haven’t actually run this, but my thought was that if you were handling this manually you’d have a job spec with the test DC as a constraint based on ${node.datacenter} and another jobspec with the inverse constraints. When it’s time to upgrade, you change the jobspec that targets the test DC first, run the spec, and test the deploy. If it works, you update the other jobspec and then run that.

If you wanted to handle this less manually and templatize things things like Datacenter, version, etc. I think you could achieve that with Nomad Pack, and a few variables.

Also, I forgot to mention the max_parallel setting which is important for blue-greens.

1 Like

I don’t know that I’m quite following this. According to the job specification docs, if you set count = canary, then

the new version of the group is deployed along side the existing set. While this duplicates the resources required during the upgrade process

So this would result in the old version and the new version running at the same time, which isn’t what I want.

And then yeah, I can’t see a way ensure the new tasks get sent only to one datacenter.
The only way I can see to accomplish this is to create 2 “groups”, one with a contraint on the datacenter, and the other with a constraint not on the datacenter. Unfortunately this means we can’t use the promote functionality, and would instead have to modify the job spec to “promote”.

Yeah, for safety reasons it does require you to explicitly promote, which would then shutdown the old versions’ allocations

“Once the operator is satisfied that the new version of the group is stable, the group can be promoted which will result in all allocations for the old versions of the group to be shutdown. This completes the upgrade from blue to green, or old to new version.”

Only one should be active at a time though, in case you were concerned some requests might go to one version while some requests go to another. This allows an all or nothing switchover, and a fast rollback capability if you decide the new version isn’t working out. If your concern the resource utilization, then yeah, that is kind of the tradeoff for a blue-green deployment model.

Your group idea is definitely better than the multiple jobspec approach. Good call on that. Again though, it would require the trade-off of “promoting” manually, when you decide to deploy to the non-testing datacenters.

I’m guessing you don’t want any downtime, even in the testing datacenter. Is that true, or is downtime in the testing datacenter ok? In other words, is that a non-production environment that is able to tolerate downtime so that you don’t have 2x the number of resources running?