Control Nomad job restart due to Vault key update

Hi team,

Our Nomad jobs are getting restarted whenever a linked vault key is updated, which is the desired behaviour. I am looking for some info how can we better control the restarts. We have 8 instances in a job and if all 8 restart at the same time, we experience and outage for about 5-6 seconds (the time it takes for the allocs to restart).

Can the update or restart stanza be useful here? Asking as I could not find anything in the docs.

thanks.

1 Like

Hi @vikas.saroha,

The template.splay job specification option should help avoid this thundering heard problem.

Thanks,
jrasell and the Nomad team

1 Like

thanks @jrasell. Appreciate your help!

1 Like

@jrasell we have noticed that while splay helped with spreading out the restarts, it does not solve the outage problem entirely. The concerned application is a webapp that uses consul service discovery + traefik ingress. When an allocation restarts it’s not removed from the consul catalog and hence traefik keeps sending requests to it while the alloc is being restarted (takes about 10-15 seconds during which the app returns 502 errors).
I wonder if there’s a way to remove the alloc from consul before Nomad kills it.

The deployment process does exactly that hence we see no 502 responses during deployments.

I have noticed that the cli cmd nomad alloc stop xxxx also removes the service from consul so that works too.

Just wanted to confirm with you if it’s possible with Nomad currently or do we need to build something custom to handle the template updates.

Hi @vikas.saroha,

Do the services have attached checks? If not, this would be an addition I would look into as Traefik should remove services from its routing table that are unhealthy.

I just took a look into the code, and it seems there is a difference between services blocks at the task level or group level. I wonder if you could try moving your service blocks to the task level, and whether this helps your current situation?

Thanks,
jrasell and the Nomad team

Thanks for looking into it @jrasell.

Yes, we are using a check block in the service.

check {
          name     = "alive"
          type     = "http"
          path     = "/health"
          interval = "2s"
          timeout  = "1s"
        }

I moved the service to the task and the outcome is still the same.

There are 2 scenarios where the restarts are gracefully handled -

  1. Deployments
  2. Stopping the allocations by the cli

And in both of those the alloc receives the killing signal. It seems that helps with removing the alloc from consult Catalog before killing the alloc.

The events for restarts triggered by template change however show a different sequence of events -

Perhaps a solution could be to add stop as an option to change_mode. Which would shutting down of the alloc and allow for graceful handling of template changes.

For reference this our job spec -

job "[[.job_name]]" {
  datacenters = ["[[.datacenter]]"]
  type = "service"

  group "[[.group_name]]" {
    count = 3
    network {
      port "http" {
        to = 3000
      }
    }

    update {
      max_parallel     = [[ or .max_parallel 1 ]]
      canary           = [[ or .canary 1 ]]
      auto_revert      = true
      auto_promote     = true
      min_healthy_time = "1s"
      healthy_deadline = "2m"
      health_check = "checks"
    }

    task "[[.task_name]]-[[.namespace]]" {
      driver = "docker"
      config {
        image = "[[.image_repo]]:[[.image_tag]]"
        ports = ["http"]
        entrypoint = ["sh", "-c"]
        command = "bin/puma -C config/puma.rb"
      }

      service {
        name = "[[.namespace]]-puma"
        tags = [
          "traefik.enable=true",
          "traefik.http.routers.[[.namespace]]-puma.rule=Host(`[[.app_url]]`)"
        ]
        port = "http"
        check {
          name     = "alive"
          type     = "http"
          path     = "/health"
          interval = "2s"
          timeout  = "1s"
        }
      }

      shutdown_delay = "10s"

      resources {
        cpu    = [[ or .cpu 500 ]]
        memory = [[ or .memory 600 ]]
        memory_max = [[ or .memory_max (or .memory 600) ]]
      }

      restart {
        attempts = 10
        delay    = "1s"
        interval = "1m"
        mode     = "fail"
      }

      vault {
        policies = ["[[.vault_policy]]"]
        change_mode   = "restart"
      }

      template {
        data = <<-EOH
          {{ with secret "[[.secrets_path]]" }}{{ range $key, $value := .Data.data }}
          {{ $key }} = {{ $value }}
          {{ end }}{{ end }}
        EOH

        destination = "secrets/vault-env"
        env         = true
      }
    }
  }
}

PS: using change_mode=signal results in similar outcome as restart, so doesn’t fix the problem either.