What is the preferred signal to kill a zombie allocation dead and not reschedule it? My allocations attempt their 3 restarts before failing for real. Based my curling off of https://www.nomadproject.io/api/allocations.html#signal-allocation - firing off a SIGINT three times (number of restart attempts) appears to work sort of?
this is in reference to this bug: https://github.com/hashicorp/nomad/issues/5363
restart {
attempts = 3
delay = "10s"
interval = "90s"
mode = "fail"
}
meta {
version = "${version_label}"
region = "${aws_region}"
service = "api"
}
task "api" {
driver = "docker"
config {
image = "amazon-account.dkr.ecr.${aws_region}.amazonaws.com/company/api:${version_label}"
force_pull = true
dns_servers = [ "$${NOMAD_IP_http}" ]
logging {
type = "awslogs"
config {
awslogs-region = "${aws_region}"
awslogs-group = "/nomad/jobs/${subdomain}-api-${datacenter}"
awslogs-create-group = true
}
}
Stumbled into this issue as well. Could i be an issue with the Docker driver?
Nomad v0.9.3 (c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce)
Docker version 18.09.7, build 2d0083d
Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-1043-aws x86_64)
Can confirm killing the job itself and re-applying works. Haven’t tried rebooting the Nomad client itself as that would cause a prod outage
After looking at the syslog running on the affected nomad client I see this line throughout the log:
Jul 22 13:54:31 ip-10-132-35-14 dockerd[1048]: time="2019-07-22T13:54:31.955402151Z" leve