Cluster Autoscaler: AWS ASG Targets hitting rate limit

Cbeck527 · April 7, 2022, 7:06pm

Hello!

We currently have a very basic cluster scaling setup configured to keep our ASG instance count in check with our workloads. In general, it’s behaving well and has helped us grow nearly perfectly.

We’ve run into an issue lately where the DescribeScalingActivities rate limit is being hit and it has resulted in the autoscaler occasionally failing to change the desired capacity of the ASG to handle the load.

The error message is here:

2022-04-07T18:36:54.749Z [ERROR] policy_eval.worker: failed to evaluate policy: eval_id=09237e09-8c3e-9d62-bc9f-1b696da5d8f2 eval_token=b9007fc5-c4f6-e676-35f1-c816f6e93a88 id=3fd92140-b8ea-2b16-5c44-360d0835e825 policy_id=fa2ddbe7-cd4d-cbde-9e5e-a3dff5d7287a queue=cluster error="failed to fetch current count: failed to describe AWS Autoscaling Group activities: operation error Auto Scaling: DescribeScalingActivities, exceeded maximum number of attempts, 3, https response error StatusCode: 400, RequestID: 7df3bfac-f6ef-4ac9-915e-afda1792e0e1, api error Throttling: Rate exceeded"

I’m curious if there is some kind of configuration option I can tune to fix this? Or if there is some kind of policy structure that would result in fewer rate limit issues?

We currently have each ASG getting a policy file roughly equivalent to this one:

scaling "core-worker" {
  enabled = true
  min     = 1
  max     = 20

  policy {
    cooldown            = "2m"
    evaluation_interval = "5m"

    check "cpu_allocated_percentage" {
      source = "nomad-apm"
      query  = "percentage-allocated_cpu"

      strategy "target-value" {
        target = 60
      }
    }

    check "mem_allocated_percentage" {
      source = "nomad-apm"
      query  = "percentage-allocated_memory"

      strategy "target-value" {
        target = 80
      }
    }

    target "aws-asg" {
      dry_run                       = false
      aws_asg_name                  = "production-us-east-1-core-worker"
      node_class                    = "core-worker"
      node_drain_deadline           = "5m"
      node_drain_ignore_system_jobs = true
      node_purge                    = true
    }
  }
}

In total, we have two environments (stage and production) consisting of ~23 ASGs. Two autoscaling daemons; one per environment. The daemons leverage EC2 instance profiles w/ IAM policies. Would switching over to unique users and passing in the access keys/secrets be a potential fix?

Amier · April 18, 2022, 2:14pm

Hey @Cbeck527

Welcome to the nomad community! Are you still experiencing this?

Cbeck527 · April 27, 2022, 3:17pm

Hey @Amier! We were still seeing this, but ended up getting our rate limit raised via an AWS support ticket. The built-in service limits console did not provide us the right limit we needed to raise.

I’m not sure where the code for the AWS ASG target plugin is located, but there might also be a way to harden the backoff logic for those underlying APIs. I’ve had to do something similar for internal projects: Retries and Timeouts | AWS SDK for Go V2

But that’s just me being an armchair open source developer, so feel free to ignore

Topic		Replies	Views
Why did Nomad Autoscaler (v0.3.5) scale down unexpectedly Nomad	0	183	February 24, 2023
Autoscaler troubleshooting Nomad	1	997	December 7, 2021
Clarifying Nomad Autoscaler's Target Value Strategy Plugin behaviour Nomad	0	293	March 22, 2023
Nomad Autoscaler Scale_In not working in AWS ASG Nomad nomad	1	220	February 14, 2023
Autoscaling Task Group based on AWS ASG Policies Nomad	0	275	July 29, 2022

Cluster Autoscaler: AWS ASG Targets hitting rate limit

Related topics