0.9.x Zombie Allocations

mikeblum · July 23, 2019, 4:32pm

What is the preferred signal to kill a zombie allocation dead and not reschedule it? My allocations attempt their 3 restarts before failing for real. Based my curling off of https://www.nomadproject.io/api/allocations.html#signal-allocation - firing off a SIGINT three times (number of restart attempts) appears to work sort of?

this is in reference to this bug: https://github.com/hashicorp/nomad/issues/5363

 restart {
      attempts = 3
      delay    = "10s"
      interval = "90s"
      mode     = "fail"
    }

    meta {
      version = "${version_label}"
      region = "${aws_region}"
      service = "api"
    }

    task "api" {
      driver = "docker"

      config {
        image = "amazon-account.dkr.ecr.${aws_region}.amazonaws.com/company/api:${version_label}"
        force_pull = true

        dns_servers = [ "$${NOMAD_IP_http}" ]

        logging {
          type = "awslogs"
          config {
            awslogs-region = "${aws_region}"
            awslogs-group = "/nomad/jobs/${subdomain}-api-${datacenter}"
            awslogs-create-group = true
          }
        }

Stumbled into this issue as well. Could i be an issue with the Docker driver?

Nomad v0.9.3 (c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce)
Docker version 18.09.7, build 2d0083d
Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-1043-aws x86_64)

Can confirm killing the job itself and re-applying works. Haven’t tried rebooting the Nomad client itself as that would cause a prod outage

After looking at the syslog running on the affected nomad client I see this line throughout the log:

Jul 22 13:54:31 ip-10-132-35-14 dockerd[1048]: time="2019-07-22T13:54:31.955402151Z" leve

mikeblum · July 23, 2019, 5:50pm

I think the root of my problem is that the Docker driver is not acknowledging the interrupt given by Nomad when a previous version of a job should be killed.

Topic		Replies	Views
Understanding job restart behaviour on lost jobs Nomad	2	1176	May 12, 2022
How to stop job when all allocations are dead? Nomad	2	242	September 7, 2022
How to not killed all task of an allocation when one task is failing Nomad	1	334	May 10, 2021
Nomad Alloc not stopping forcefully Nomad	13	982	April 21, 2023
Question: Allocation status for failed if restart/reschedule is disabled Nomad	0	254	February 8, 2021

0.9.x Zombie Allocations

Related topics