Hello!
We currently have a very basic cluster scaling setup configured to keep our ASG instance count in check with our workloads. In general, it’s behaving well and has helped us grow nearly perfectly.
We’ve run into an issue lately where the DescribeScalingActivities
rate limit is being hit and it has resulted in the autoscaler occasionally failing to change the desired capacity of the ASG to handle the load.
The error message is here:
2022-04-07T18:36:54.749Z [ERROR] policy_eval.worker: failed to evaluate policy: eval_id=09237e09-8c3e-9d62-bc9f-1b696da5d8f2 eval_token=b9007fc5-c4f6-e676-35f1-c816f6e93a88 id=3fd92140-b8ea-2b16-5c44-360d0835e825 policy_id=fa2ddbe7-cd4d-cbde-9e5e-a3dff5d7287a queue=cluster error="failed to fetch current count: failed to describe AWS Autoscaling Group activities: operation error Auto Scaling: DescribeScalingActivities, exceeded maximum number of attempts, 3, https response error StatusCode: 400, RequestID: 7df3bfac-f6ef-4ac9-915e-afda1792e0e1, api error Throttling: Rate exceeded"
I’m curious if there is some kind of configuration option I can tune to fix this? Or if there is some kind of policy structure that would result in fewer rate limit issues?
We currently have each ASG getting a policy file roughly equivalent to this one:
scaling "core-worker" {
enabled = true
min = 1
max = 20
policy {
cooldown = "2m"
evaluation_interval = "5m"
check "cpu_allocated_percentage" {
source = "nomad-apm"
query = "percentage-allocated_cpu"
strategy "target-value" {
target = 60
}
}
check "mem_allocated_percentage" {
source = "nomad-apm"
query = "percentage-allocated_memory"
strategy "target-value" {
target = 80
}
}
target "aws-asg" {
dry_run = false
aws_asg_name = "production-us-east-1-core-worker"
node_class = "core-worker"
node_drain_deadline = "5m"
node_drain_ignore_system_jobs = true
node_purge = true
}
}
}
In total, we have two environments (stage and production) consisting of ~23 ASGs. Two autoscaling daemons; one per environment. The daemons leverage EC2 instance profiles w/ IAM policies. Would switching over to unique users and passing in the access keys/secrets be a potential fix?