Job stuck in limbo, how to prevent this from happening?

axsuul · June 19, 2022, 7:01am

Hi, I have a job stuck in limbo

ID            = sidekiq
Name          = sidekiq
Submit Date   = 2022-06-17T21:32:21-07:00
Type          = system
Priority      = 50
Datacenters   = main1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sidekiq     0       0         0        3       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created     Modified
c9df38b5  c0314718  sidekiq     8        run      failed  30m48s ago  26m47s ago
2646c3e2  c0314718  sidekiq     8        run      failed  43m41s ago  37m18s ago
fded9750  c0314718  sidekiq     8        run      failed  1d2h ago    43m42s ago

Unfortunately it’s not revealing much as to why it’s stopped retrying. The job file for this looks something like this

job "sidekiq" {
  datacenters = ["main1"]

  # One on each node that meets constraints
  type = "system"

  update {
    max_parallel = 1
    min_healthy_time = "1m"
    stagger = "1m"
    auto_promote = true
    auto_revert = true
    canary = 1
  }

  group "sidekiq" {
    restart {
      attempts = 1
      delay = "1s"
      mode = "delay"
      interval = "5s"
    }
   
    ...
}

From the config, I had assumed it would keep restarting with a delay. Can anyone please provide insight into this as well as how I can always keep having it retry if there’s a failure, indefinitely?

Thanks!

seth.hoenig · June 21, 2022, 1:04pm

Hi @axsuul at first glance the restart block seems fine. For system jobs, only max_parallel and stagger are respected in the update block, but I don’t think the extra fields should affect anything.

Can you get output from nomad eval list to see if there’s maybe something blocking placement?

axsuul · June 22, 2022, 6:46pm

Thanks for the reply, will have to get you this the next time it happens

Topic		Replies	Views
Nomad job troubleshooting Nomad	6	4114	August 27, 2021
How to force schedule job after "Failed due to progress deadline"? Nomad	5	1856	January 5, 2022
Understanding job restart behaviour on lost jobs Nomad	2	1014	May 12, 2022
Nomad not rescheduling system jobs on nodes that previously ran out of disk space Nomad	2	274	July 7, 2022
Avoid rescheduling due to resource constraints with batch jobs Nomad	3	711	April 16, 2021

Job stuck in limbo, how to prevent this from happening?

Related topics