Cluster Autoscaler: AWS ASG Targets hitting rate limit


We currently have a very basic cluster scaling setup configured to keep our ASG instance count in check with our workloads. In general, it’s behaving well and has helped us grow nearly perfectly.

We’ve run into an issue lately where the DescribeScalingActivities rate limit is being hit and it has resulted in the autoscaler occasionally failing to change the desired capacity of the ASG to handle the load.

The error message is here:

2022-04-07T18:36:54.749Z [ERROR] policy_eval.worker: failed to evaluate policy: eval_id=09237e09-8c3e-9d62-bc9f-1b696da5d8f2 eval_token=b9007fc5-c4f6-e676-35f1-c816f6e93a88 id=3fd92140-b8ea-2b16-5c44-360d0835e825 policy_id=fa2ddbe7-cd4d-cbde-9e5e-a3dff5d7287a queue=cluster error="failed to fetch current count: failed to describe AWS Autoscaling Group activities: operation error Auto Scaling: DescribeScalingActivities, exceeded maximum number of attempts, 3, https response error StatusCode: 400, RequestID: 7df3bfac-f6ef-4ac9-915e-afda1792e0e1, api error Throttling: Rate exceeded"

I’m curious if there is some kind of configuration option I can tune to fix this? Or if there is some kind of policy structure that would result in fewer rate limit issues?

We currently have each ASG getting a policy file roughly equivalent to this one:

scaling "core-worker" {
  enabled = true
  min     = 1
  max     = 20

  policy {
    cooldown            = "2m"
    evaluation_interval = "5m"

    check "cpu_allocated_percentage" {
      source = "nomad-apm"
      query  = "percentage-allocated_cpu"

      strategy "target-value" {
        target = 60

    check "mem_allocated_percentage" {
      source = "nomad-apm"
      query  = "percentage-allocated_memory"

      strategy "target-value" {
        target = 80

    target "aws-asg" {
      dry_run                       = false
      aws_asg_name                  = "production-us-east-1-core-worker"
      node_class                    = "core-worker"
      node_drain_deadline           = "5m"
      node_drain_ignore_system_jobs = true
      node_purge                    = true

In total, we have two environments (stage and production) consisting of ~23 ASGs. Two autoscaling daemons; one per environment. The daemons leverage EC2 instance profiles w/ IAM policies. Would switching over to unique users and passing in the access keys/secrets be a potential fix?

Hey @Cbeck527

Welcome to the nomad community! Are you still experiencing this? :eyes:

Hey @Amier! We were still seeing this, but ended up getting our rate limit raised via an AWS support ticket. The built-in service limits console did not provide us the right limit we needed to raise.

I’m not sure where the code for the AWS ASG target plugin is located, but there might also be a way to harden the backoff logic for those underlying APIs. I’ve had to do something similar for internal projects: Retries and Timeouts | AWS SDK for Go V2

But that’s just me being an armchair open source developer, so feel free to ignore :slight_smile: