I have configured Nomad autoscaler with AWS ASG with min 7 and max 200.
Version of nomad-autoscaler: v0.3.5.
The nature of workload is CPU bound and the query to prometheus soruce is
query = "ceil((sum(nomad_client_allocs_cpu_allocated{namespace=\"${namespace}\"}) + sum(nomad_nomad_blocked_evals_job_cpu{namespace=\"${namespace}\"} OR on() vector(0)))/6000)"
Problem
Sometimes it is observed that autoscaler marks a node ineligible, even though there is blocked evaluation / unplaced evaluations. The nodes are in ready and ineligible state. Ideally, the expected behaviour should have been :
Not to mark it ineligible and place the workload.
Remove the node from ASG and scale out ASG.
What are the possible reasons? Is it a known issue? Any pointer to a similar problem would be useful.
The nomad-autoscaler will be reacting to the response to the query, therefore to understand the behaviour more it would be useful to see logs messages from the autoscaler at the time one of these situations occurs. If you have the full scaling policy, that would be useful also to better understand what strategy is being used.
The query itself, if I am understanding it correctly, calculates the total allocated CPU for allocations running within “$namespace” and then adds the amount of CPU requested by evaluations but are blocked due to resource exhaustion.
The nodes are in ready and ineligible state
Is the autoscaler removing the nodes from the ASG, or just leaving them as ineligible? Again, better understanding here would require logs and other context.
Thanks for the prompt response. I have observed two types of errors
internal_plugin.aws-asg: node pool status readiness check failed: error="node 18a728a6-26dc-f161-bab4-0fce3e63fba1 is d raining, node 5ac68c24-bb30-44f2-5829-581b1623895a is draining, node 76f875dd-8abf-d363-cfc4-7ff548f864b7 is draining, node 47e6591c-1d9b-0231-df05-0892 62080217 is draining, node 75fed26e-3dd5-235f-b78f-82ec4d7f89da is draining, node 05ceef1a-7bcf-7e7a-8794-88ba16dde885 is draining, node 7ae96fbf-6acf-5 8a6-a4fc-8217659265ef is draining, node 5e7e6984-7f6b-b383-d860-f6779cb6097a is draining, node f883811d-4da7-0c55-b59d-49dbb6059c38 is draining, node aa 989f2d-52ba-86f5-954f-947b2a59bb06 is draining"
also
policy_eval.worker: failed to evaluate policy: eval_id=7777755e-7e27-09f8-80c0-dd017c3b4dee eval_token=5261e d5e-4f6d-2f87-ed53-d03d66b2b87c id=61f292fc-760c-e172-3095-e196982d4e84 policy_id=c4d837ab-f45c-303a-ea9a-8309a1b02590 queue=cluster error="f ailed to scale target: failed to perform scaling action: failed to perform pre-scale Nomad scale in tasks: context done while monitoring node drain: received error while draining node: Error monitoring node: Get "http://10.10.0.85:4646/v1/node/c9acf6e1-12ef-6236-269b-e662fc9b1a3a?i ndex=12918&namespace=dev®ion=global&stale=": dial tcp 10.10.0.85:4646: i/o timeout, context done while monitoring node drain: received err or while draining node: Error monitoring allocations: Get "http://10.10.0.85:4646/v1/node/71d147c8-2031-02b5-feea-daee12267e19/allocations?in dex=12908&namespace=dev®ion=global&stale=": dial tcp 10.10.0.85:4646: i/o timeout, context done while monitoring node drain: received erro r while draining node: Error monitoring allocations: Get "http://10.10.0.85:4646/v1/node/76f875dd-8abf-d363-cfc4-7ff548f864b7/allocations?ind ex=13089&namespace=dev®ion=global&stale=": dial tcp 10.10.0.85:4646: i/o timeout, context done while monitoring node drain: received error while draining node: Error monitoring allocations: Get "http://10.10.0.85:4646/v1/node/4592c999-9240-e62d-edd3-5c3faa37129e/allocations?inde x=13089&namespace=dev®ion=global&stale=": dial tcp 10.10.0.85:4646: i/o timeout"
It seems the autoscaler incurs a network problem when monitoring the node for draining. Is there any proxy between the nomad-autoscaler and the Nomad agent it talks to or any network instability?