Control Nomad job restart due to Vault key update

vikas.saroha · November 25, 2022, 1:43am

Hi team,

Our Nomad jobs are getting restarted whenever a linked vault key is updated, which is the desired behaviour. I am looking for some info how can we better control the restarts. We have 8 instances in a job and if all 8 restart at the same time, we experience and outage for about 5-6 seconds (the time it takes for the allocs to restart).

Can the update or restart stanza be useful here? Asking as I could not find anything in the docs.

thanks.

jrasell · November 25, 2022, 12:00pm

Hi @vikas.saroha,

The template.splay job specification option should help avoid this thundering heard problem.

Thanks,
jrasell and the Nomad team

vikas.saroha · November 28, 2022, 10:15pm

thanks @jrasell. Appreciate your help!

vikas.saroha · February 14, 2023, 1:27am

@jrasell we have noticed that while splay helped with spreading out the restarts, it does not solve the outage problem entirely. The concerned application is a webapp that uses consul service discovery + traefik ingress. When an allocation restarts it’s not removed from the consul catalog and hence traefik keeps sending requests to it while the alloc is being restarted (takes about 10-15 seconds during which the app returns 502 errors).
I wonder if there’s a way to remove the alloc from consul before Nomad kills it.

The deployment process does exactly that hence we see no 502 responses during deployments.

I have noticed that the cli cmd nomad alloc stop xxxx also removes the service from consul so that works too.

Just wanted to confirm with you if it’s possible with Nomad currently or do we need to build something custom to handle the template updates.

jrasell · February 14, 2023, 12:00pm

Hi @vikas.saroha,

Do the services have attached checks? If not, this would be an addition I would look into as Traefik should remove services from its routing table that are unhealthy.

I just took a look into the code, and it seems there is a difference between services blocks at the task level or group level. I wonder if you could try moving your service blocks to the task level, and whether this helps your current situation?

Thanks,
jrasell and the Nomad team

vikas.saroha · February 14, 2023, 11:22pm

Thanks for looking into it @jrasell.

Yes, we are using a check block in the service.

check {
          name     = "alive"
          type     = "http"
          path     = "/health"
          interval = "2s"
          timeout  = "1s"
        }

I moved the service to the task and the outcome is still the same.

There are 2 scenarios where the restarts are gracefully handled -

Deployments
Stopping the allocations by the cli

And in both of those the alloc receives the killing signal. It seems that helps with removing the alloc from consult Catalog before killing the alloc.

The events for restarts triggered by template change however show a different sequence of events -

Perhaps a solution could be to add stop as an option to change_mode. Which would shutting down of the alloc and allow for graceful handling of template changes.

For reference this our job spec -

job "[[.job_name]]" {
  datacenters = ["[[.datacenter]]"]
  type = "service"

  group "[[.group_name]]" {
    count = 3
    network {
      port "http" {
        to = 3000
      }
    }

    update {
      max_parallel     = [[ or .max_parallel 1 ]]
      canary           = [[ or .canary 1 ]]
      auto_revert      = true
      auto_promote     = true
      min_healthy_time = "1s"
      healthy_deadline = "2m"
      health_check = "checks"
    }

    task "[[.task_name]]-[[.namespace]]" {
      driver = "docker"
      config {
        image = "[[.image_repo]]:[[.image_tag]]"
        ports = ["http"]
        entrypoint = ["sh", "-c"]
        command = "bin/puma -C config/puma.rb"
      }

      service {
        name = "[[.namespace]]-puma"
        tags = [
          "traefik.enable=true",
          "traefik.http.routers.[[.namespace]]-puma.rule=Host(`[[.app_url]]`)"
        ]
        port = "http"
        check {
          name     = "alive"
          type     = "http"
          path     = "/health"
          interval = "2s"
          timeout  = "1s"
        }
      }

      shutdown_delay = "10s"

      resources {
        cpu    = [[ or .cpu 500 ]]
        memory = [[ or .memory 600 ]]
        memory_max = [[ or .memory_max (or .memory 600) ]]
      }

      restart {
        attempts = 10
        delay    = "1s"
        interval = "1m"
        mode     = "fail"
      }

      vault {
        policies = ["[[.vault_policy]]"]
        change_mode   = "restart"
      }

      template {
        data = <<-EOH
          {{ with secret "[[.secrets_path]]" }}{{ range $key, $value := .Data.data }}
          {{ $key }} = {{ $value }}
          {{ end }}{{ end }}
        EOH

        destination = "secrets/vault-env"
        env         = true
      }
    }
  }
}

PS: using change_mode=signal results in similar outcome as restart, so doesn’t fix the problem either.

Topic		Replies	Views
Restart tasks one by one Nomad	4	864	January 12, 2022
How to restart a Nomad job upon Vault key change? Nomad	4	2725	February 21, 2022
Alternatives to vault_grace + splay combo Nomad consul-template	0	812	March 27, 2020
Nomad Job Restarts via REST API Nomad	3	1508	June 12, 2019
Don't restart Jobs when Vault Task Token TTL expires Nomad	5	468	May 30, 2022

Control Nomad job restart due to Vault key update

Related topics