[nomad-autoscaler][prometheus] Use unallocated resource metrics as target-value strategy

I successfully using nomad-autoscaler using prometheus as the apm, target-value as strategy, and aws-asg as target

autoscaler config :

  apm "prometheus" {
    driver = "prometheus"
    config = {
      address = "http://prometheus.example.com:9090"
    }
  }

  strategy "target-value" {
    driver = "target-value"
  }
  target "aws-asg-us-west-2" {
    driver = "aws-asg"
    config = {
      aws_region = "us-west-2"
    }
  }

autoscaler policy :

scaling "cluster-policy-us-west-2" {
  enabled = true
  min     = 1
  max     = 100
  policy {
    cooldown            = "1m"
    evaluation_interval = "5m"

    check "cpu_allocated_percentage" {
      source = "prometheus"
      query  = "sum(nomad_client_allocated_cpu{region=\"us-west-2\"}*100/(nomad_client_unallocated_cpu{region=\"us-west-2\"}nomad_client_allocated_cpu{region=\"us-west-2\"}))/count(nomad_client_allocated_cpu{region=\"us-west-2\"})"
      strategy "target-value" {
        target = 80
      }
    }

    check "mem_allocated_percentage" {
      source = "prometheus"
      query  = "sum(nomad_client_allocated_memory{region=\"us-west-2\"}*100/(nomad_client_unallocated_memory{region=\"us-west-2\"}nomad_client_allocated_memory{region=\"us-west-2\"}))/count(nomad_client_allocated_memory{region=\"us-west-2\"})"
      strategy "target-value" {
        target = 80
      }
    }

    target "aws-asg-us-west-2" {
      dry-run                       = "false"
      aws_asg_name                  = "armada-nomad-accelbyte-dev-180"
      node_class                    = "us-west-2-aws-180"
      node_purge                    = "true"
      node_drain_deadline           = "15m"
      node_drain_ignore_system_jobs = "false"
      node_selector_strategy        = "empty_ignore_system"
    }
  }
}

But what we need is the unallocated cpu / memory of total resource in the ASG, because if we do like above, for example if it have 100 Node, the 20 Nodes just idle and do nothing

For example :
if unallocated memory (nomad_client_unallocated_memory memory) lower that 5G trigger scale out, still struggle how to achieve that in the prometheus query and target-value strategy of the policy config

already have some in mind to used this promql :

sum(
  nomad_client_unallocated_memory{region="us-east-1"}
)

but how about the target-value strategy target config ? as I need if the result lower than 5G (for example), do scale out

is the target-value strategy support inverst config ? when lower do scale out when higher do scale in ?

anyone can help, how I achieve that?

Hi @petrukngantuk1,

The target-value is not the best for this use case. It’s a known limitation of the Nomad Autoscaler and we’re working on a new strategy that would better suit your need.

Take a look at this issue for a preview of that we’re building: `threshold` strategy · Issue #438 · hashicorp/nomad-autoscaler · GitHub

Once this threshold strategy is done, you will be able to create a policy like this:

check "scale_in_high_threshold_zone" {
  source = "prometheus"
  query  = "sum(nomad_client_unallocated_memory{region="us-east-1"})"

  strategy "threshold" {
    # While our metric value is below 5GB...
    upper_bound = 5000
    # ...remove one instance.
    delta = -1
  }
}

I know it’s hard to tell from a theoretical example, but do you think this what you are looking for?

after looking at the github issue, I guess it’s was what I looking for…

thanks for addressing this,

btw, any ETA for the threshold strategy to be done ?

1 Like

No ETA yet, but we’re working on it right now, so it should be out soon :slightly_smiling_face: