Configuring Autoscaler

If we are running a Nomad cluster with VM machines and need to auto scale the docker applications should i need to configure auto scaler on the servers or should i need to auto scaler on cluster as a separate job

Hi @ctr0306,

Both options are possible, but running the Autoscaler as a Nomad job is usually easier. Here’s a sample job from our horizontal application scaling demo.

Hi @ lgfa29

Thanks a lot for your reply. But i have a question here.
I have 10 VM machines as clients to nomad master

So If i want to run auto scaling for an application (docker) i need to run auto scaler on all 10 machines as a docker container or is it fine if i run one auto scaler parallel to my 10 VM machines

How can i manage auto scaling of my application to 10 VM machines

Thanks
ctr0306

1 Like

You only need one Autoscaler, and it doesn’t matter how many VMs you have. You will run the Autoscaler as a Nomad job, so it be scheduled in one of those VMs.

Once you have it running, you can update your Docker job that you want to Autoscale with a scaling block to define its policy.

1 Like

Hi @lgfa29

Thanks a lot. If that is the case should i need to write any hcl file for example autoscaling.hcl
how to bind autoscaler with nomad server

@lgfa2a

configured autoscaler.hcl as below and ran it as ./nomad-autoscaler agent --config /etc/autoscaler.hcl
got the error as 2021-02-09T16:02:06.006Z [ERROR] agent: failed to setup HTTP getHealth server: error=“could not setup HTTP listener: listen tcp nomad-server-ip:9999: bind: cannot assign requested address”

http {
bind_address = “nomad-server-ip”
bind_port = 9999
}
nomad {
address = “http://nomad-server-ip:4646”
}

apm “prometheus” {
driver = “prometheus”

config = {
address = “http://prometheus-server-ip:9090”
}
}
strategy “target-value” {
driver = “target-value”
}

This error indicates that the Autoscaler can’t listen on port 9999 of nomad-server-ip. Are you using a real IP address as bind_address? And is port 9999 being used by some other process?

bind_address should the IP of the host (it will default to 127.0.0.1, so you normally wouldn’t have to change it). The bind_port should be a port that is not being used in the host.

Hi @lgfa

Sorry for not responding… i have been away on personal reasons.
your suggestion worked but now i am getting the below errors while auto scaling. could you please suggest

Feb 18 15:42:51 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:42:51.075Z [INFO] policy_eval.broker: eval nack’d, retrying it: eval_id=098e43b3-4985-ca57-9039-66027a544941 policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 token=eb4738b5-806a-3ac0-231f-977148901c54
Feb 18 15:42:51 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:42:51.080Z [INFO] policy_eval.worker.check_handler: scaling target: check=uptime id=4fc5b8e9-566d-713e-571d-0bd9c9480fff policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 queue=horizontal source=prometheus strategy=target-value target=nomad-target from=3 to=2 reason=“capped count from 1 to 2 to stay within limits” meta=“map[nomad_autoscaler.count.capped:true nomad_autoscaler.count.original:1 nomad_autoscaler.reason_history:[scaling down because factor is 0.277778 scaling down because factor is 0.277778] nomad_policy_id:44d25ac2-9069-768d-2d9a-a87dc8202f20]”
Feb 18 15:42:51 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:42:51.085Z [ERROR] policy_eval.worker.check_handler: failed to submit scaling action to target: check=uptime id=4fc5b8e9-566d-713e-571d-0bd9c9480fff policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 queue=horizontal source=prometheus strategy=target-value target=nomad-target error=“failed to scale group /: Unexpected response code: 400 (job scaling blocked due to active deployment)”
Feb 18 15:42:51 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:42:51.085Z [ERROR] policy_eval.worker: failed to evaluate policy: eval_id=098e43b3-4985-ca57-9039-66027a544941 eval_token=3050cfab-fe60-ee1d-ef4c-0b1012154771 id=4fc5b8e9-566d-713e-571d-0bd9c9480fff policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 queue=horizontal err=“failed to scale target: failed to scale group /: Unexpected response code: 400 (job scaling blocked due to active deployment)”
Feb 18 15:42:51 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:42:51.085Z [WARN] policy_eval.broker: eval delivery limit reached: eval_id=098e43b3-4985-ca57-9039-66027a544941 policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 token=3050cfab-fe60-ee1d-ef4c-0b1012154771 count=2 limit=2
Feb 18 15:43:01 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:43:01.066Z [WARN] policy_manager.policy_handler: failed to get target status: policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 error=“Unexpected response code: 500 (No path to region)”
Feb 18 15:48:11 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:48:11.051Z [WARN] policy_manager.policy_handler: failed to get target status: policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 error=“Unexpected response code: 500 (No path to region)”
Feb 18 15:48:21 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:48:21.050Z [WARN] policy_manager.policy_handler: failed to get target status: policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 error=“Unexpected response code: 500 (No path to region)”

No worries @ctr0306, we are to help at any time :slightly_smiling_face:

This error message indicates that your jobs might be running in a different region, so you will need to configure the Autoscaler to connect to that specific region. You can do this in the Autoscaler configuration file, using the region parameter inside the nomad block.

Hi @lgfa29,

Thanks a lot for all your help. Now i am able to achieve auto scale based on uptime query and the query is ==========> query = “avg(up{job=“nomad_node_exporter”})”
If i try to autoscale based on the number of allocations i am failing here my query is ============> query = “avg(nomad_client_allocations_running{job=“nomad”})”

Please correct me if I am doing anything wrong in using the query nomad_client_alloations_running.

group “test” {
count = 3

constraint {
attribute = “${node.class}”
value = “CTR”
}

scaling {
enabled = true
min = 2
max = 4

policy {

cooldown = “20s”

check “uptime” {
source = “prometheus”
query = “avg(nomad_client_allocations_running{job=“nomad”})”

query = “avg(up{job=“nomad_node_exporter”})”

strategy “target-value” {

target = 0.5

target = 1.2
}
}
}
}

Thanks
ctr0306

Hum…It’s kind of hard to tell what’s wrong without more details, like logs or error messages. What are you seeing as the failure?

From the job snippet I see a few things:

  • the query should escape the inner quote, so something like this query = “avg(nomad_client_allocations_running{job=\“nomad\”})”, but I am not sure if this was just the HTML formatting
  • there are 2 query and 2 target in the same check. That’s not a valid policy since each policy should only have one of each per check. If you have 2 metrics you will need 2 check blocks.

Hi @lgfa29,

i will get back to you with error logs but what should be general query i need to use to scale up or scale down based on “nomad_client_allocations_running” in prometheus

Hi @lgfa29,

can i know how to run autoscaler as a service

You can run the Autoscaler as a normal service job in Nomad. It’s hard to provide any specific job file because it will depend on your infrastructure, but I would recommend looking at our demos for specific examples.

I too can relate to this confusion/doubt: “how do I run the autoscaler daemon?”
:slightly_smiling_face:

Answered in the autoscaler job example by @lgfa29 :+1:

@lgfa29 for a PROD scenario, would it be prudent to reserve the node(s) where the autoscaler daemon runs? (preferably something like an AWS ASG with a fixed count of 1 (or 2))

Though the eternal question of how to prevent other jobs from running on the autoscaler nodes remains (!)

Hi @shantanugadgil,

We avoid shutting down the node where the Autoscaler is running (source code), so you shouldn’t have to reserve any pool of nodes for the Autoscaler specifically.

If your use case requires scaling an ASG to 0 then you should avoid running the Autoscaler in that ASG. You can control this by adding a constraint in the Autoscaler job targeting the node_class for example:

job "autoscaler" {
  # Don't run in the ASG that is being scaled.
  constraint {
    attribute = "${node.class}"
    operator  = "!="
    value     = "hashistack"
  }
...
}

No, My thought process was for the scenario “other way round”. I would imagine to opt-in the autoscaler job onto the node_class of value autoscaler. The usual problem I was referring to was “how to prevent actual compute jobs from landing onto the ‘reserved nodes for autoscaler’” :slight_smile:

Ah got it.

Yes, that’s a tricky one, since it would require every non-autoscaler job to have a constraint to not run in that ASG. Some kind of job templating would help, but it doesn’t provide any guarantees if, for example, someone forgets to add the constaint.

1 Like