Job restarts delay not working as expected

frank.wettstein · November 9, 2022, 7:46am

I have defined the following restart-strategy on group-level

 restart {
      interval = "10m"
      attempts = 2
      delay    = "15s"
      mode     = "fail"
    }

But the delay of 15 seconds is not taken into account, the job restarts immediately:

Nov 09, '22 08:42:53 +0100	Started	Task started by client
Nov 09, '22 08:42:52 +0100	Restarting	Task restarting in 15.587610296s
Nov 09, '22 08:42:52 +0100	Terminated	Exit Code: 2, Exit Message: Docker container exited with non-zero exit code: 2
Nov 09, '22 08:42:51 +0100	Restart Signaled	healthcheck: check fail_service health using http endpoint ‘/health’ unhealthy
Nov 09, '22 08:41:57 +0100	Started	Task started by client
Nov 09, '22 08:41:56 +0100	Restarting	Task restarting in 16.822710794s
Nov 09, '22 08:41:56 +0100	Terminated	Exit Code: 2, Exit Message: Docker container exited with non-zero exit code: 2

Is this a bug or is there something wrong in my configuration?

Tested with Nomad 1.4.1

Here the whole job-specification

job "fail-service" {
  datacenters = ["isys_poc"]
 
  type = "service"
 
  group "fail-service" {
    count = 1
 
    network {
      port "http" {
        to = 8080
        }
    }
 
    task "fail-service" {
      driver = "docker"
      config {
        image = "thobe/fail_service:v0.0.12"
        ports = ["http"]
      }
 
      service {
        name = "${TASK}"
        port = "http"
        check {
          name     = "fail_service health using http endpoint '/health'"
          port     = "http"
          type     = "http"
          path     = "/health"
          method   = "GET"
          interval = "10s"
          timeout  = "2s"
        }
        tags = [
          "traefik.enable=true",
          "traefik.http.routers.fail-service.rule=Host(`fail-service.poc-nomad.intersys.internal`)",
        ]
      }
 
      env {
        HEALTHY_FOR    = -1 # Stays healthy forever
      }
 
      resources {
        cpu    = 100 # MHz
        memory = 256 # MB
      }
    }
  }
}

seth.hoenig · November 9, 2022, 8:14pm

Hi @frank.wettstein , at first glance it does seem like a bug. Do you mind opening a GitHub issue so we can track it?

Topic		Replies	Views
Understanding job restart behaviour on lost jobs Nomad	2	1070	May 12, 2022
Restart tasks one by one Nomad	4	843	January 12, 2022
Task Scheduling Latency Nomad	1	643	November 30, 2020
Unhealthy task is only restarted once despite restart policy Nomad health-check	1	1588	October 26, 2020
Restart policy of successful batch jobs Nomad	2	402	January 25, 2021

Job restarts delay not working as expected

Related topics