Restart tasks one by one

Hello,

I have job similar to following:

job "test {
  group "main" {
    count = 2
    update {
      max_parallel = 1
    }
    service {
      name = "srv"
      check { ... }
    }
    task "a" {
      driver = "docker"
       ...
       template {
         env = true
         data = "X={{ key "test/x" }}"
       }
    }
  }
}

Task “a” takes some significant time to startup (for example 10 mins). It isn’t an issue during deployment because instance “a1” handles requests until “a2” starts and passes checks and vice versa. But if the value of config value “test/x” changed or the state of “srv” check became unhealthy for both task allocations (soft fail when restart recommended but not strictly required) I get 10 minutes of downtime until both services restarted.

I’m looking for ways to avoid parallel restarts (similar to what happening during deploy). It is easy without nomad (like using restart command consul lock restart-lock restart-service-and-wait-healthy) but I am unable to find a simple way to do it with nomad.

One possible solution that I can imagine is to change the application stop process, handle SIGTERM, check if instance “a2” restarting and if it is - block stop for 10 mins (+ configure kill_timeout for 10mins). But this solution looks complex and error prone. Are there easier ways to achieve the same result?

Hi @valodzka :wave:

It doesn’t quite does what you need, but maybe a splay value could help reduce the downtime. It doesn’t guarantee that that allocations won’t be restarting at the same time though.

Could you open a feature request for this in our repo?

Thank you!

Splay might work well with fast restart but if restart takes 5 or 10 minutes I think it’s unreasonable. Value should be a few hours to make downtime reasonably rare.

Okay, opened feature request coordinate restarts across clients (template rerender, check restart, etc.) · Issue #10920 · hashicorp/nomad · GitHub

1 Like

100% agree. I just mentioned as a (bad) mitigation.

Thank you!

There were more issues regarding this. Let’s continue in the one with the most explanation: Support re-rendering template expressions with no service disruption · Issue #6151 · hashicorp/nomad · GitHub