[aws][nomad-autoscaler] Multi Region AWS ASG in one nomad container

petrukngantuk1 · April 13, 2021, 7:52am

Hello is it possible to use one nomad-autoscaler task / container / job, to managed multi region nomad-client asg ?

config :

      log_level = "INFO"

      nomad {
        address = "http://nomad-server.example.com:4646"
        region  = "global"
      }

      http {
        bind_address = "0.0.0.0"
        bind_port    = 8080
      }

      telemetry {
        disable_hostname          = true
        enable_hostname_label     = true
        collection_interval       = "1s"
        prometheus_metrics        = true
        prometheus_retention_time = "6h"
      }

      apm "prometheus" {
        driver = "prometheus"
        config = {
          address = "http://prometheus.example.com:9090"
        }
      }

      strategy "target-value" {
        driver = "target-value"
      }

      target "aws-asg-us-west-2" {
        driver = "aws-asg"
        config = {
          aws_region = "us-west-2"
        }
      }

      target "aws-asg-us-east-1" {
        driver = "aws-asg"
        config = {
          aws_region = "us-east-1"
        }
      }

policy :

      scaling "cluster-policy-us-west-2" {
        enabled = true
        min     = 1
        max     = 10
        policy {
          cooldown            = "1m"
          evaluation_interval = "1m"
          check "cpu_allocated_percentage" {
            source = "prometheus"
            query  = "sum(nomad_client_allocated_cpu{region=\"us-west-2\"}*100/(nomad_client_unallocated_cpu{region=\"us-west-2\"}+nomad_client_allocated_cpu{region=\"us-west-2\"}))/count(nomad_client_allocated_cpu{region=\"us-west-2\"})"
            strategy "target-value" {
              target = 70
            }
          }

          check "mem_allocated_percentage" {
            source = "prometheus"
            query  = "sum(nomad_client_allocated_memory{region=\"us-west-2\"}*100/(nomad_client_unallocated_memory{region=\"us-west-2\"}+nomad_client_allocated_memory{region=\"us-west-2\"}))/count(nomad_client_allocated_memory{region=\"us-west-2\"})"
            strategy "target-value" {
              target = 70
            }
          }

          target "aws-asg-us-west-2" {
            dry-run                       = "false"
            aws_asg_name                  = "nomad-clients-us-west-2-888"
            node_class                    = "us-west-2-aws"
            node_purge                    = "true"
            node_drain_deadline           = "15m"
            node_drain_ignore_system_jobs = "false"
            node_selector_strategy        = "empty_ignore_system"
          }
        }
      }

      scaling "cluster-policy-us-east-1" {
        enabled = true
        min     = 1
        max     = 10
        policy {
          cooldown            = "1m"
          evaluation_interval = "1m"
          check "cpu_allocated_percentage" {
            source = "prometheus"
            query  = "sum(nomad_client_allocated_cpu{region=\"us-east-1\"}*100/(nomad_client_unallocated_cpu{region=\"us-east-1\"}+nomad_client_allocated_cpu{region=\"us-east-1\"}))/count(nomad_client_allocated_cpu{region=\"us-east-1\"})"
            strategy "target-value" {
              target = 70
            }
          }

          check "mem_allocated_percentage" {
            source = "prometheus"
            query  = "sum(nomad_client_allocated_memory{region=\"us-east-1\"}*100/(nomad_client_unallocated_memory{region=\"us-east-1\"}+nomad_client_allocated_memory{region=\"us-east-1\"}))/count(nomad_client_allocated_memory{region=\"us-east-1\"})"
            strategy "target-value" {
              target = 70
            }
          }

          target "aws-asg-us-east-1" {
            dry-run                       = "false"
            aws_asg_name                  = "nomad-clients-us-east-1-999"
            node_class                    = "us-east-1-aws"
            node_purge                    = "true"
            node_drain_deadline           = "15m"
            node_drain_ignore_system_jobs = "false"
            node_selector_strategy        = "empty_ignore_system"
          }
        }
      }

already tried above config, but no luck till now

some logs:

2021-04-13T08:25:04.234Z [ERROR] policy_eval.worker: failed to evaluate policy: eval_id=ca908cdd-b966-2539-7875-e0fa5167f2c4 eval_token=50bb1578-9101-71ed-326c-8d4ba2a8e05b id=80ebb943-0b1c-c76e-f4bb-3e1ac59002ad policy_id=1c9a16e5-597a-eb07-3d63-8e816c3a63fa queue=cluster error="failed to scale target: failed to perform scaling action: 1 selected nodes are not found within ASG"
2021-04-13T08:25:04.234Z [WARN]  policy_eval.broker: eval delivery limit reached: eval_id=ca908cdd-b966-2539-7875-e0fa5167f2c4 policy_id=1c9a16e5-597a-eb07-3d63-8e816c3a63fa token=50bb1578-9101-71ed-326c-8d4ba2a8e05b count=1 limit=1
2021-04-13T08:26:00.760Z [INFO]  policy_eval.worker: scaling target: id=f67e05ce-2c38-60c1-8d84-50e4060c41e0 policy_id=1c9a16e5-597a-eb07-3d63-8e816c3a63fa queue=cluster target=aws-asg-us-west-2 from=6 to=1 reason="scaling down because factor is 0.051227" meta=map[nomad_policy_id:1c9a16e5-597a-eb07-3d63-8e816c3a63fa]
2021-04-13T08:26:04.434Z [ERROR] policy_eval.worker: failed to evaluate policy: eval_id=6987504c-7631-d0d8-4b41-fee71ce97ce9 eval_token=cccdea9f-0a46-9dcd-17aa-68881ea6c553 id=f67e05ce-2c38-60c1-8d84-50e4060c41e0 policy_id=1c9a16e5-597a-eb07-3d63-8e816c3a63fa queue=cluster error="failed to scale target: failed to perform scaling action: 1 selected nodes are not found within ASG"

Is it because the ASG Name differ with Launch Configuration Name ?

lgfa29 · April 15, 2021, 12:15am

Hi @petrukngantuk1

Your configuration and policies look right to me. I think that you are experiencing a bug in the way the Autoscaler selects nodes for removal.

I went ahead a filed an issue and we’ll look further into this. I think we have all that we need to get this fix, but feel free to let us know if you experience any other problems.

Apologies for the trouble, and thank you for bringing this to our attention.

petrukngantuk1 · April 16, 2021, 10:04am

No problems,

glad to help…

Topic		Replies	Views
[nomad-autoscaler][prometheus] Use unallocated resource metrics as target-value strategy Nomad	3	401	April 19, 2021
Nomad Autoscaler Scale_In not working in AWS ASG Nomad nomad	1	220	February 14, 2023
Nomad AutoScaling with AWS target Nomad	4	629	June 16, 2022
Nomad Autoscaler aws-asg: filtering by node_class and datacenter Nomad	2	364	October 7, 2021
Autoscaling Task Group based on AWS ASG Policies Nomad	0	275	July 29, 2022

[aws][nomad-autoscaler] Multi Region AWS ASG in one nomad container

Related topics