Restart tasks one by one

valodzka · July 20, 2021, 7:30am

Hello,

I have job similar to following:

job "test {
  group "main" {
    count = 2
    update {
      max_parallel = 1
    }
    service {
      name = "srv"
      check { ... }
    }
    task "a" {
      driver = "docker"
       ...
       template {
         env = true
         data = "X={{ key "test/x" }}"
       }
    }
  }
}

Task “a” takes some significant time to startup (for example 10 mins). It isn’t an issue during deployment because instance “a1” handles requests until “a2” starts and passes checks and vice versa. But if the value of config value “test/x” changed or the state of “srv” check became unhealthy for both task allocations (soft fail when restart recommended but not strictly required) I get 10 minutes of downtime until both services restarted.

I’m looking for ways to avoid parallel restarts (similar to what happening during deploy). It is easy without nomad (like using restart command consul lock restart-lock restart-service-and-wait-healthy) but I am unable to find a simple way to do it with nomad.

One possible solution that I can imagine is to change the application stop process, handle SIGTERM, check if instance “a2” restarting and if it is - block stop for 10 mins (+ configure kill_timeout for 10mins). But this solution looks complex and error prone. Are there easier ways to achieve the same result?

lgfa29 · July 20, 2021, 11:09pm

Hi @valodzka

It doesn’t quite does what you need, but maybe a splay value could help reduce the downtime. It doesn’t guarantee that that allocations won’t be restarting at the same time though.

Could you open a feature request for this in our repo?

Thank you!

valodzka · July 21, 2021, 9:21am

Splay might work well with fast restart but if restart takes 5 or 10 minutes I think it’s unreasonable. Value should be a few hours to make downtime reasonably rare.

Okay, opened feature request coordinate restarts across clients (template rerender, check restart, etc.) · Issue #10920 · hashicorp/nomad · GitHub

lgfa29 · July 21, 2021, 2:51pm

100% agree. I just mentioned as a (bad) mitigation.

Thank you!

tino · January 12, 2022, 2:42pm

There were more issues regarding this. Let’s continue in the one with the most explanation: Support re-rendering template expressions with no service disruption · Issue #6151 · hashicorp/nomad · GitHub

Topic		Replies	Views
Control Nomad job restart due to Vault key update Nomad	5	445	February 14, 2023
Job restarts delay not working as expected Nomad	1	428	November 9, 2022
Nomad lifecycle hook tasks Nomad consul-nomad	1	537	February 16, 2023
Nomad task restarting but never restarts Nomad	3	604	May 19, 2022
Nomad task constantly restarts due to uncontrolled template rerender Nomad consul-template	4	1606	August 8, 2022

Restart tasks one by one

Related topics