Nomad Autoscaler: how to delay scaling evaluation during allocation startup

We have configured a scaling policy for a Job that takes more than 100% CPU during the start-up time and stabilizes at 40% after the startup phase. The scaling policy is set to evaluate every 30s and scale if the CPU is 50%.

Due to High CPU usage during the startup, AutoScaler adds one allocation after another until it reaches the max threshold. After the app stabilizes, allocations are reduced to min size.

For a similar challenge, AWS auto scaler provides instanceWarmUp, which will delay Scaling evaluations during the start-up.

Is there a similar approach in Nomad Autoscaler, Or what is the best to resolve this?

We are facing a similar problem in our setup, following this thread.

Can someone help with this challenge?

This is quite interesting. If you need help on this the first step would be to provide us a way to reproduce the problem. I understand that with complex architectures this is impossible tho.

Could you share a sample version of your job file? (In .hcl please! lol)

I think this will make the trick.

Pay attention to cooldown and evaluation_interval attributes in the policy stanza.

job "example" {
  group "app" {
    scaling {
      min     = 2
      max     = 10
      enabled = true

      policy {
        evaluation_interval = "5s"
        cooldown            = "1m"

        check "active_connections" {
          source = "prometheus"
          query  = "scalar(open_connections_example_cache)"

          strategy "target-value" {
            target = 10
          }
        }
      }
    }
  }
}

From docs:

  • cooldown - A time interval after a scaling action during which no additional scaling will be performed on the resource. It should be provided as a duration (e.g.: "5s", "1m"). If omitted the configuration value policy_default_cooldown from the agent will be used.
  • evaluation_interval - Defines how often the policy is evaluated by the Autoscaler. It should be provided as a duration (e.g.: "5s", "1m"). If omitted the configuration value default_evaluation_interval from the agent will be used.

I think the cooldown will do the trick if you set with the appropriate value when the job has been stabilized.