If we are running a Nomad cluster with VM machines and need to auto scale the docker applications should i need to configure auto scaler on the servers or should i need to auto scaler on cluster as a separate job
Hi @ctr0306,
Both options are possible, but running the Autoscaler as a Nomad job is usually easier. Hereâs a sample job from our horizontal application scaling demo.
Hi @ lgfa29
Thanks a lot for your reply. But i have a question here.
I have 10 VM machines as clients to nomad master
So If i want to run auto scaling for an application (docker) i need to run auto scaler on all 10 machines as a docker container or is it fine if i run one auto scaler parallel to my 10 VM machines
How can i manage auto scaling of my application to 10 VM machines
Thanks
ctr0306
You only need one Autoscaler, and it doesnât matter how many VMs you have. You will run the Autoscaler as a Nomad job, so it be scheduled in one of those VMs.
Once you have it running, you can update your Docker job that you want to Autoscale with a scaling
block to define its policy.
Hi @lgfa29
Thanks a lot. If that is the case should i need to write any hcl file for example autoscaling.hcl
how to bind autoscaler with nomad server
@lgfa2a
configured autoscaler.hcl as below and ran it as ./nomad-autoscaler agent --config /etc/autoscaler.hcl
got the error as 2021-02-09T16:02:06.006Z [ERROR] agent: failed to setup HTTP getHealth server: error=âcould not setup HTTP listener: listen tcp nomad-server-ip:9999: bind: cannot assign requested addressâ
http {
bind_address = ânomad-server-ipâ
bind_port = 9999
}
nomad {
address = âhttp://nomad-server-ip:4646â
}
apm âprometheusâ {
driver = âprometheusâ
config = {
address = âhttp://prometheus-server-ip:9090â
}
}
strategy âtarget-valueâ {
driver = âtarget-valueâ
}
This error indicates that the Autoscaler canât listen on port 9999 of nomad-server-ip
. Are you using a real IP address as bind_address
? And is port 9999 being used by some other process?
bind_address
should the IP of the host (it will default to 127.0.0.1
, so you normally wouldnât have to change it). The bind_port
should be a port that is not being used in the host.
Hi @lgfa
Sorry for not responding⌠i have been away on personal reasons.
your suggestion worked but now i am getting the below errors while auto scaling. could you please suggest
Feb 18 15:42:51 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:42:51.075Z [INFO] policy_eval.broker: eval nackâd, retrying it: eval_id=098e43b3-4985-ca57-9039-66027a544941 policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 token=eb4738b5-806a-3ac0-231f-977148901c54
Feb 18 15:42:51 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:42:51.080Z [INFO] policy_eval.worker.check_handler: scaling target: check=uptime id=4fc5b8e9-566d-713e-571d-0bd9c9480fff policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 queue=horizontal source=prometheus strategy=target-value target=nomad-target from=3 to=2 reason=âcapped count from 1 to 2 to stay within limitsâ meta=âmap[nomad_autoscaler.count.capped:true nomad_autoscaler.count.original:1 nomad_autoscaler.reason_history:[scaling down because factor is 0.277778 scaling down because factor is 0.277778] nomad_policy_id:44d25ac2-9069-768d-2d9a-a87dc8202f20]â
Feb 18 15:42:51 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:42:51.085Z [ERROR] policy_eval.worker.check_handler: failed to submit scaling action to target: check=uptime id=4fc5b8e9-566d-713e-571d-0bd9c9480fff policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 queue=horizontal source=prometheus strategy=target-value target=nomad-target error=âfailed to scale group /: Unexpected response code: 400 (job scaling blocked due to active deployment)â
Feb 18 15:42:51 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:42:51.085Z [ERROR] policy_eval.worker: failed to evaluate policy: eval_id=098e43b3-4985-ca57-9039-66027a544941 eval_token=3050cfab-fe60-ee1d-ef4c-0b1012154771 id=4fc5b8e9-566d-713e-571d-0bd9c9480fff policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 queue=horizontal err=âfailed to scale target: failed to scale group /: Unexpected response code: 400 (job scaling blocked due to active deployment)â
Feb 18 15:42:51 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:42:51.085Z [WARN] policy_eval.broker: eval delivery limit reached: eval_id=098e43b3-4985-ca57-9039-66027a544941 policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 token=3050cfab-fe60-ee1d-ef4c-0b1012154771 count=2 limit=2
Feb 18 15:43:01 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:43:01.066Z [WARN] policy_manager.policy_handler: failed to get target status: policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 error=âUnexpected response code: 500 (No path to region)â
Feb 18 15:48:11 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:48:11.051Z [WARN] policy_manager.policy_handler: failed to get target status: policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 error=âUnexpected response code: 500 (No path to region)â
Feb 18 15:48:21 f989b069-58cb-65f9-b212-ac251eb10eef nomad-autoscaler[913440]: 2021-02-18T15:48:21.050Z [WARN] policy_manager.policy_handler: failed to get target status: policy_id=44d25ac2-9069-768d-2d9a-a87dc8202f20 error=âUnexpected response code: 500 (No path to region)â
No worries @ctr0306, we are to help at any time
This error message indicates that your jobs might be running in a different region, so you will need to configure the Autoscaler to connect to that specific region. You can do this in the Autoscaler configuration file, using the region
parameter inside the nomad
block.
Hi @lgfa29,
Thanks a lot for all your help. Now i am able to achieve auto scale based on uptime query and the query is ==========> query = âavg(up{job=ânomad_node_exporterâ})â
If i try to autoscale based on the number of allocations i am failing here my query is ============> query = âavg(nomad_client_allocations_running{job=ânomadâ})â
Please correct me if I am doing anything wrong in using the query nomad_client_alloations_running.
group âtestâ {
count = 3
constraint {
attribute = â${node.class}â
value = âCTRâ
}
scaling {
enabled = true
min = 2
max = 4
policy {
cooldown = â20sâ
check âuptimeâ {
source = âprometheusâ
query = âavg(nomad_client_allocations_running{job=ânomadâ})â
query = âavg(up{job=ânomad_node_exporterâ})â
strategy âtarget-valueâ {
target = 0.5
target = 1.2
}
}
}
}
Thanks
ctr0306
HumâŚItâs kind of hard to tell whatâs wrong without more details, like logs or error messages. What are you seeing as the failure?
From the job snippet I see a few things:
- the query should escape the inner quote, so something like this
query = âavg(nomad_client_allocations_running{job=\ânomad\â})â
, but I am not sure if this was just the HTML formatting - there are 2
query
and 2target
in the samecheck
. Thatâs not a valid policy since each policy should only have one of each percheck
. If you have 2 metrics you will need 2check
blocks.
Hi @lgfa29,
i will get back to you with error logs but what should be general query i need to use to scale up or scale down based on ânomad_client_allocations_runningâ in prometheus
Hi @lgfa29,
can i know how to run autoscaler as a service
You can run the Autoscaler as a normal service
job in Nomad. Itâs hard to provide any specific job file because it will depend on your infrastructure, but I would recommend looking at our demos for specific examples.
I too can relate to this confusion/doubt: âhow do I run the autoscaler daemon?â
Answered in the autoscaler job example by @lgfa29
@lgfa29 for a PROD scenario, would it be prudent to reserve the node(s) where the autoscaler daemon runs? (preferably something like an AWS ASG with a fixed count of 1 (or 2))
Though the eternal question of how to prevent other jobs from running on the autoscaler nodes remains (!)
Hi @shantanugadgil,
We avoid shutting down the node where the Autoscaler is running (source code), so you shouldnât have to reserve any pool of nodes for the Autoscaler specifically.
If your use case requires scaling an ASG to 0 then you should avoid running the Autoscaler in that ASG. You can control this by adding a constraint
in the Autoscaler job targeting the node_class
for example:
job "autoscaler" {
# Don't run in the ASG that is being scaled.
constraint {
attribute = "${node.class}"
operator = "!="
value = "hashistack"
}
...
}
No, My thought process was for the scenario âother way roundâ. I would imagine to opt-in
the autoscaler job onto the node_class
of value autoscaler
. The usual problem I was referring to was âhow to prevent actual compute jobs from landing onto the âreserved nodes for autoscalerââ
Ah got it.
Yes, thatâs a tricky one, since it would require every non-autoscaler job to have a constraint
to not run in that ASG. Some kind of job templating would help, but it doesnât provide any guarantees if, for example, someone forgets to add the constaint
.