Autoscaler question: what happens when the job being stopped in Nomad?

dpotapov · June 10, 2021, 7:21pm

I’m considering using the autoscaler and wasn’t able to find in the documentation how it behaves in the following situation:

Job submitted to Nomad with horizontal scaling enabled
Autoscaler scales up (new nodes are dynamically created)
Job stopped (and purged) by administrator
Will the autoscaler do a cleanup afterward?

lgfa29 · June 11, 2021, 12:44am

Hi @dpotapov

Yes, the Autoscaler will automatically scale in your cluster if necessary. If you want to see this in action, you can follow this hands-on tutorial: Horizontal Cluster Autoscaling | Nomad - HashiCorp Learn.

Let me know if that answers your question

dpotapov · June 11, 2021, 2:58pm

I’ve figured how to reproduce the behavior I want and apparently, autoscaler is not doing scale in action when the job stopped.

So I’ve created autoscaler-config.hcl with the following contents:

plugin_dir = "./plugins"

nomad {
  address = "http://localhost:4646"
}

target "noop-target" {
    driver = "noop-target"
}

strategy "fixed-value" {
  driver = "fixed-value"
}

Executing autoscaler like this:

./nomad-autoscaler agent -config autoscaler-config.hcl

My job file:

job "testjob" {
  datacenters = ["dc1"]
  type = "service"

  group "testjob" {

    scaling {
      enabled = true
      min = 1
      max = 3
      policy {
        evaluation_interval = "5s"
        cooldown = "30s"

        check "fixed-3" {
          strategy "fixed-value" {
            value = 3
          }
        }

        target "noop-target" {
            count = "3"
            ready = "true"
        }
      }
    }

    task "testtask" {
      driver = "raw_exec"

      config {
        command = "/bin/sh"
        args = ["-c", "while true; do date; sleep 10; done"]
      }
    }
  }
}

Autoscaler log:

2021-06-11T09:41:11.486-0500 [INFO]  agent: Nomad Autoscaler agent started! Log data will stream in below:
2021-06-11T09:41:11.486-0500 [INFO]  agent.http_server: server now listening for connections: address=127.0.0.1:8080
2021-06-11T09:41:11.668-0500 [INFO]  agent.plugin_manager: successfully launched and dispensed plugin: plugin_name=nomad-target
2021-06-11T09:41:12.191-0500 [INFO]  external_plugin.noop-target: set config: config=map[nomad_address:http://localhost:4646] timestamp=2021-06-11T09:41:12.191-0500
2021-06-11T09:41:12.191-0500 [INFO]  agent.plugin_manager: successfully launched and dispensed plugin: plugin_name=noop-target
2021-06-11T09:41:12.281-0500 [INFO]  agent.plugin_manager: successfully launched and dispensed plugin: plugin_name=nomad-apm
2021-06-11T09:41:12.281-0500 [INFO]  agent.plugin_manager: successfully launched and dispensed plugin: plugin_name=target-value
2021-06-11T09:41:12.281-0500 [INFO]  agent.plugin_manager: successfully launched and dispensed plugin: plugin_name=fixed-value
2021-06-11T09:41:12.282-0500 [INFO]  policy_eval: starting workers: cluster=10 horizontal=10
2021-06-11T09:41:26.800-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:26.800-0500
2021-06-11T09:41:26.801-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:26.801-0500
2021-06-11T09:41:31.800-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:31.800-0500
2021-06-11T09:41:31.800-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:31.800-0500
2021-06-11T09:41:36.802-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:36.802-0500

Now I run nomad job stop testjob and expect that autoscaler will perform scaling down from 3 to 0, but it keeps querying the status:

2021-06-11T09:41:41.802-0500 [INFO]  external_plugin.noop-target: received status request: ready=true count=3 timestamp=2021-06-11T09:41:41.802-0500
2021-06-11T09:41:41.802-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:41.802-0500
2021-06-11T09:41:46.801-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:46.801-0500
2021-06-11T09:41:46.801-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:46.801-0500
2021-06-11T09:41:51.802-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:51.802-0500
2021-06-11T09:41:51.802-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:51.802-0500
2021-06-11T09:41:56.802-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:56.802-0500

If I do nomad job stop -purge testjob autoscaler stops doing anything. Kinda not what I would expect.

lgfa29 · June 11, 2021, 10:34pm

You need a scaling policy that targets your cluster clients. Take a look at the policy shown as an example in the “Run the Nomad Autoscaler job” section of the tutorial:

scaling "cluster_policy" {
  enabled = true
  min     = 1
  max     = 2

  policy {
    cooldown            = "2m"
    evaluation_interval = "1m"

    check "cpu_allocated_percentage" {
      source = "prometheus"
      query  = "sum(nomad_client_allocated_cpu{node_class=\"hashistack\"}*100/(nomad_client_unallocated_cpu{node_class=\"hashistack\"}+nomad_client_allocated_cpu{node_class=\"hashistack\"}))/count(nomad_client_allocated_cpu{node_class=\"hashistack\"})"

      strategy "target-value" {
        target = 70
      }
    }

...

    check "mem_allocated_percentage" {
      source = "prometheus"
      query  = "sum(nomad_client_allocated_memory{node_class=\"hashistack\"}*100/(nomad_client_unallocated_memory{node_class=\"hashistack\"}+nomad_client_allocated_memory{node_class=\"hashistack\"}))/count(nomad_client_allocated_memory{node_class=\"hashistack\"})"

      strategy "target-value" {
        target = 70
      }
    }

...

    target "aws-asg" {
      dry-run             = "false"
      aws_asg_name        = "hashistack-nomad_client"
      node_class          = "hashistack"
      node_drain_deadline = "5m"
    }
  }
}

This one, for example, will scale an AWS ASG based on the percentage of allocated memory and CPU.

Topic		Replies	Views
Getting to know the Nomad Autoscaler Nomad	8	1212	October 27, 2020
Clarifying Nomad Autoscaler's Target Value Strategy Plugin behaviour Nomad	0	293	March 22, 2023
[Nomad][Autoscaler] Observability of scaling actions Nomad	0	138	September 19, 2023
Is cluster scaling with Nomad autoscaler stable for production deploy Nomad	2	419	August 6, 2020
Nomad Application Autoscaling Issue Nomad	2	352	August 12, 2022

Autoscaler question: what happens when the job being stopped in Nomad?

Related topics