Job stuck in limbo, how to prevent this from happening?

Hi, I have a job stuck in limbo

ID            = sidekiq
Name          = sidekiq
Submit Date   = 2022-06-17T21:32:21-07:00
Type          = system
Priority      = 50
Datacenters   = main1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sidekiq     0       0         0        3       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created     Modified
c9df38b5  c0314718  sidekiq     8        run      failed  30m48s ago  26m47s ago
2646c3e2  c0314718  sidekiq     8        run      failed  43m41s ago  37m18s ago
fded9750  c0314718  sidekiq     8        run      failed  1d2h ago    43m42s ago

Unfortunately it’s not revealing much as to why it’s stopped retrying. The job file for this looks something like this

job "sidekiq" {
  datacenters = ["main1"]

  # One on each node that meets constraints
  type = "system"

  update {
    max_parallel = 1
    min_healthy_time = "1m"
    stagger = "1m"
    auto_promote = true
    auto_revert = true
    canary = 1
  }

  group "sidekiq" {
    restart {
      attempts = 1
      delay = "1s"
      mode = "delay"
      interval = "5s"
    }
   
    ...
}

From the config, I had assumed it would keep restarting with a delay. Can anyone please provide insight into this as well as how I can always keep having it retry if there’s a failure, indefinitely?

Thanks!

Hi @axsuul at first glance the restart block seems fine. For system jobs, only max_parallel and stagger are respected in the update block, but I don’t think the extra fields should affect anything.

Can you get output from nomad eval list to see if there’s maybe something blocking placement?

Thanks for the reply, will have to get you this the next time it happens