Autoscaler question: what happens when the job being stopped in Nomad?

I’m considering using the autoscaler and wasn’t able to find in the documentation how it behaves in the following situation:

  1. Job submitted to Nomad with horizontal scaling enabled
  2. Autoscaler scales up (new nodes are dynamically created)
  3. Job stopped (and purged) by administrator
  4. Will the autoscaler do a cleanup afterward?

Hi @dpotapov :wave:

Yes, the Autoscaler will automatically scale in your cluster if necessary. If you want to see this in action, you can follow this hands-on tutorial: Horizontal Cluster Autoscaling | Nomad - HashiCorp Learn.

Let me know if that answers your question :slightly_smiling_face:

I’ve figured how to reproduce the behavior I want and apparently, autoscaler is not doing scale in action when the job stopped.

So I’ve created autoscaler-config.hcl with the following contents:

plugin_dir = "./plugins"

nomad {
  address = "http://localhost:4646"
}

target "noop-target" {
    driver = "noop-target"
}

strategy "fixed-value" {
  driver = "fixed-value"
}

Executing autoscaler like this:

./nomad-autoscaler agent -config autoscaler-config.hcl

My job file:

job "testjob" {
  datacenters = ["dc1"]
  type = "service"

  group "testjob" {

    scaling {
      enabled = true
      min = 1
      max = 3
      policy {
        evaluation_interval = "5s"
        cooldown = "30s"

        check "fixed-3" {
          strategy "fixed-value" {
            value = 3
          }
        }

        target "noop-target" {
            count = "3"
            ready = "true"
        }
      }
    }

    task "testtask" {
      driver = "raw_exec"

      config {
        command = "/bin/sh"
        args = ["-c", "while true; do date; sleep 10; done"]
      }
    }
  }
}

Autoscaler log:

2021-06-11T09:41:11.486-0500 [INFO]  agent: Nomad Autoscaler agent started! Log data will stream in below:
2021-06-11T09:41:11.486-0500 [INFO]  agent.http_server: server now listening for connections: address=127.0.0.1:8080
2021-06-11T09:41:11.668-0500 [INFO]  agent.plugin_manager: successfully launched and dispensed plugin: plugin_name=nomad-target
2021-06-11T09:41:12.191-0500 [INFO]  external_plugin.noop-target: set config: config=map[nomad_address:http://localhost:4646] timestamp=2021-06-11T09:41:12.191-0500
2021-06-11T09:41:12.191-0500 [INFO]  agent.plugin_manager: successfully launched and dispensed plugin: plugin_name=noop-target
2021-06-11T09:41:12.281-0500 [INFO]  agent.plugin_manager: successfully launched and dispensed plugin: plugin_name=nomad-apm
2021-06-11T09:41:12.281-0500 [INFO]  agent.plugin_manager: successfully launched and dispensed plugin: plugin_name=target-value
2021-06-11T09:41:12.281-0500 [INFO]  agent.plugin_manager: successfully launched and dispensed plugin: plugin_name=fixed-value
2021-06-11T09:41:12.282-0500 [INFO]  policy_eval: starting workers: cluster=10 horizontal=10
2021-06-11T09:41:26.800-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:26.800-0500
2021-06-11T09:41:26.801-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:26.801-0500
2021-06-11T09:41:31.800-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:31.800-0500
2021-06-11T09:41:31.800-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:31.800-0500
2021-06-11T09:41:36.802-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:36.802-0500

Now I run nomad job stop testjob and expect that autoscaler will perform scaling down from 3 to 0, but it keeps querying the status:

2021-06-11T09:41:41.802-0500 [INFO]  external_plugin.noop-target: received status request: ready=true count=3 timestamp=2021-06-11T09:41:41.802-0500
2021-06-11T09:41:41.802-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:41.802-0500
2021-06-11T09:41:46.801-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:46.801-0500
2021-06-11T09:41:46.801-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:46.801-0500
2021-06-11T09:41:51.802-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:51.802-0500
2021-06-11T09:41:51.802-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:51.802-0500
2021-06-11T09:41:56.802-0500 [INFO]  external_plugin.noop-target: received status request: count=3 ready=true timestamp=2021-06-11T09:41:56.802-0500

If I do nomad job stop -purge testjob autoscaler stops doing anything. Kinda not what I would expect.

You need a scaling policy that targets your cluster clients. Take a look at the policy shown as an example in the “Run the Nomad Autoscaler job” section of the tutorial:

scaling "cluster_policy" {
  enabled = true
  min     = 1
  max     = 2

  policy {
    cooldown            = "2m"
    evaluation_interval = "1m"

    check "cpu_allocated_percentage" {
      source = "prometheus"
      query  = "sum(nomad_client_allocated_cpu{node_class=\"hashistack\"}*100/(nomad_client_unallocated_cpu{node_class=\"hashistack\"}+nomad_client_allocated_cpu{node_class=\"hashistack\"}))/count(nomad_client_allocated_cpu{node_class=\"hashistack\"})"

      strategy "target-value" {
        target = 70
      }
    }

...

    check "mem_allocated_percentage" {
      source = "prometheus"
      query  = "sum(nomad_client_allocated_memory{node_class=\"hashistack\"}*100/(nomad_client_unallocated_memory{node_class=\"hashistack\"}+nomad_client_allocated_memory{node_class=\"hashistack\"}))/count(nomad_client_allocated_memory{node_class=\"hashistack\"})"

      strategy "target-value" {
        target = 70
      }
    }

...

    target "aws-asg" {
      dry-run             = "false"
      aws_asg_name        = "hashistack-nomad_client"
      node_class          = "hashistack"
      node_drain_deadline = "5m"
    }
  }
}

This one, for example, will scale an AWS ASG based on the percentage of allocated memory and CPU.