Unhealthy task is only restarted once despite restart policy

We’re new to Nomad and still figuring out how to setup jobs so that unhealthy tasks are properly restarted. We use the docker image thobe/fail_service to simulate a service that starts up healthy, stays healthy for 20 seconds, but then turns unhealthy for the next 120 seconds.

Nomad correctly detects the service being unhealthy, then restarts the task, and we expect Nomad to keep restarting the task until it reaches restart > attemps. However it only ever restarts the task once, then ignores the fact that the service is unhealthy.

As a workaround, setting restart > attemps to 0 will at least reschedule the task, and that continues to work.

This is our job definition:

job "fail-service" {
  datacenters = ["interxion"]
  type = "service"

  reschedule {
    delay = "15s"
    delay_function = "constant"
    unlimited = true
  }

  group "api" {
    count = 1

    restart {
      attempts = 3
      interval = "30s"
      delay = "5s"
      mode = "fail"
    }

    network {
      mode = "bridge"
      port "http" {
        to = 8080
      }
    }

    service = {
      name = "fail-service-nomad"
      port = "http"

      check {
        type = "http"
        port = "http"
        path = "/health"
        interval = "10s"
        timeout = "2s"

        check_restart {
          limit = 1
          grace = "10s"
          ignore_warnings = false
        }
      }
    }

    task "main" {
      driver = "docker"

      config {
        image = "thobe/fail_service:v0.1.0"
        ports = ["http"]
      }

      env = {
        HEALTHY_FOR = 20
        UNHEALTHY_FOR = 120
      }

      resources = {
        cpu = 100
        memory = 128
      }
    }
  }
}

This is an exemplary allocation status:

$ nomad alloc status 2a73772c
ID                  = 2a73772c-59bc-982f-1464-d5a7cf2efeaf
Eval ID             = 32cc154a
Name                = fail-service.api[0]
Node ID             = 77ae2919
Node Name           = nomad-client01
Job ID              = fail-service
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 10m47s ago
Modified            = 10m15s ago
Deployment ID       = 0d3f5d5f
Deployment Health   = healthy

Allocation Addresses (mode = "bridge")
Label  Dynamic  Address
*http  yes      10.0.0.10:23844 -> 8080

Task "main" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/100 MHz  1.4 MiB/128 MiB  300 MiB  

Task Events:
Started At     = 2020-10-23T11:32:18Z
Finished At    = N/A
Total Restarts = 1
Last Restart   = 2020-10-23T13:32:12+02:00

Recent Events:
Time                       Type              Description
2020-10-23T13:32:18+02:00  Started           Task started by client
2020-10-23T13:32:12+02:00  Restarting        Task restarting in 5.341617655s
2020-10-23T13:32:11+02:00  Terminated        Exit Code: 2, Exit Message: "Docker container exited with non-zero exit code: 2"
2020-10-23T13:32:11+02:00  Restart Signaled  healthcheck: check "service: \"fail-service-nomad\" check" unhealthy
2020-10-23T13:31:49+02:00  Started           Task started by client
2020-10-23T13:31:47+02:00  Task Setup        Building Task Directory
2020-10-23T13:31:46+02:00  Received          Task received by client

We’re running Nomad 0.12.6 and Consul 1.8.4 (both with TLS and ACLs enabled). Any help is appreciated!

After chatting with some folks over at Gitter, it seems that this is most likely a bug. So I opened an issue at https://github.com/hashicorp/nomad/issues/9176.

Please consider this topic closed here :slight_smile: