We’re new to Nomad and still figuring out how to setup jobs so that unhealthy tasks are properly restarted. We use the docker image thobe/fail_service to simulate a service that starts up healthy, stays healthy for 20 seconds, but then turns unhealthy for the next 120 seconds.
Nomad correctly detects the service being unhealthy, then restarts the task, and we expect Nomad to keep restarting the task until it reaches restart > attemps. However it only ever restarts the task once, then ignores the fact that the service is unhealthy.
As a workaround, setting restart > attemps to 0 will at least reschedule the task, and that continues to work.
This is our job definition:
job "fail-service" {
datacenters = ["interxion"]
type = "service"
reschedule {
delay = "15s"
delay_function = "constant"
unlimited = true
}
group "api" {
count = 1
restart {
attempts = 3
interval = "30s"
delay = "5s"
mode = "fail"
}
network {
mode = "bridge"
port "http" {
to = 8080
}
}
service = {
name = "fail-service-nomad"
port = "http"
check {
type = "http"
port = "http"
path = "/health"
interval = "10s"
timeout = "2s"
check_restart {
limit = 1
grace = "10s"
ignore_warnings = false
}
}
}
task "main" {
driver = "docker"
config {
image = "thobe/fail_service:v0.1.0"
ports = ["http"]
}
env = {
HEALTHY_FOR = 20
UNHEALTHY_FOR = 120
}
resources = {
cpu = 100
memory = 128
}
}
}
}
This is an exemplary allocation status:
$ nomad alloc status 2a73772c
ID = 2a73772c-59bc-982f-1464-d5a7cf2efeaf
Eval ID = 32cc154a
Name = fail-service.api[0]
Node ID = 77ae2919
Node Name = nomad-client01
Job ID = fail-service
Job Version = 0
Client Status = running
Client Description = Tasks are running
Desired Status = run
Desired Description = <none>
Created = 10m47s ago
Modified = 10m15s ago
Deployment ID = 0d3f5d5f
Deployment Health = healthy
Allocation Addresses (mode = "bridge")
Label Dynamic Address
*http yes 10.0.0.10:23844 -> 8080
Task "main" is "running"
Task Resources
CPU Memory Disk Addresses
0/100 MHz 1.4 MiB/128 MiB 300 MiB
Task Events:
Started At = 2020-10-23T11:32:18Z
Finished At = N/A
Total Restarts = 1
Last Restart = 2020-10-23T13:32:12+02:00
Recent Events:
Time Type Description
2020-10-23T13:32:18+02:00 Started Task started by client
2020-10-23T13:32:12+02:00 Restarting Task restarting in 5.341617655s
2020-10-23T13:32:11+02:00 Terminated Exit Code: 2, Exit Message: "Docker container exited with non-zero exit code: 2"
2020-10-23T13:32:11+02:00 Restart Signaled healthcheck: check "service: \"fail-service-nomad\" check" unhealthy
2020-10-23T13:31:49+02:00 Started Task started by client
2020-10-23T13:31:47+02:00 Task Setup Building Task Directory
2020-10-23T13:31:46+02:00 Received Task received by client
We’re running Nomad 0.12.6 and Consul 1.8.4 (both with TLS and ACLs enabled). Any help is appreciated!