[Nomad][autoscaler] Worker nodes marked as ineligible even with blocked evaluations

jdk2588 · October 13, 2022, 1:06pm

Context:

I have configured Nomad autoscaler with AWS ASG with min 7 and max 200.
Version of nomad-autoscaler: v0.3.5.
The nature of workload is CPU bound and the query to prometheus soruce is

      query = "ceil((sum(nomad_client_allocs_cpu_allocated{namespace=\"${namespace}\"}) + sum(nomad_nomad_blocked_evals_job_cpu{namespace=\"${namespace}\"} OR on() vector(0)))/6000)"

Problem

Sometimes it is observed that autoscaler marks a node ineligible, even though there is blocked evaluation / unplaced evaluations. The nodes are in ready and ineligible state. Ideally, the expected behaviour should have been :

Not to mark it ineligible and place the workload.
Remove the node from ASG and scale out ASG.

What are the possible reasons? Is it a known issue? Any pointer to a similar problem would be useful.

jrasell · October 13, 2022, 1:23pm

Hi @jdk2588,

The nomad-autoscaler will be reacting to the response to the query, therefore to understand the behaviour more it would be useful to see logs messages from the autoscaler at the time one of these situations occurs. If you have the full scaling policy, that would be useful also to better understand what strategy is being used.

The query itself, if I am understanding it correctly, calculates the total allocated CPU for allocations running within “$namespace” and then adds the amount of CPU requested by evaluations but are blocked due to resource exhaustion.

The nodes are in ready and ineligible state

Is the autoscaler removing the nodes from the ASG, or just leaving them as ineligible? Again, better understanding here would require logs and other context.

Thanks,
jrasell and the Nomad team

jdk2588 · October 13, 2022, 11:16pm

Hi @jrasell ,

Thanks for the prompt response. I have observed two types of errors

internal_plugin.aws-asg: node pool status readiness check failed: error="node 18a728a6-26dc-f161-bab4-0fce3e63fba1 is d      raining, node 5ac68c24-bb30-44f2-5829-581b1623895a is draining, node 76f875dd-8abf-d363-cfc4-7ff548f864b7 is draining, node 47e6591c-1d9b-0231-df05-0892      62080217 is draining, node 75fed26e-3dd5-235f-b78f-82ec4d7f89da is draining, node 05ceef1a-7bcf-7e7a-8794-88ba16dde885 is draining, node 7ae96fbf-6acf-5      8a6-a4fc-8217659265ef is draining, node 5e7e6984-7f6b-b383-d860-f6779cb6097a is draining, node f883811d-4da7-0c55-b59d-49dbb6059c38 is draining, node aa      989f2d-52ba-86f5-954f-947b2a59bb06 is draining"

also

policy_eval.worker: failed to evaluate policy: eval_id=7777755e-7e27-09f8-80c0-dd017c3b4dee eval_token=5261e     d5e-4f6d-2f87-ed53-d03d66b2b87c id=61f292fc-760c-e172-3095-e196982d4e84 policy_id=c4d837ab-f45c-303a-ea9a-8309a1b02590 queue=cluster error="f     ailed to scale target: failed to perform scaling action: failed to perform pre-scale Nomad scale in tasks: context done while monitoring node      drain: received error while draining node: Error monitoring node: Get "http://10.10.0.85:4646/v1/node/c9acf6e1-12ef-6236-269b-e662fc9b1a3a?i     ndex=12918&namespace=dev&region=global&stale=": dial tcp 10.10.0.85:4646: i/o timeout, context done while monitoring node drain: received err     or while draining node: Error monitoring allocations: Get "http://10.10.0.85:4646/v1/node/71d147c8-2031-02b5-feea-daee12267e19/allocations?in     dex=12908&namespace=dev&region=global&stale=": dial tcp 10.10.0.85:4646: i/o timeout, context done while monitoring node drain: received erro     r while draining node: Error monitoring allocations: Get "http://10.10.0.85:4646/v1/node/76f875dd-8abf-d363-cfc4-7ff548f864b7/allocations?ind     ex=13089&namespace=dev&region=global&stale=": dial tcp 10.10.0.85:4646: i/o timeout, context done while monitoring node drain: received error      while draining node: Error monitoring allocations: Get "http://10.10.0.85:4646/v1/node/4592c999-9240-e62d-edd3-5c3faa37129e/allocations?inde     x=13089&namespace=dev&region=global&stale=": dial tcp 10.10.0.85:4646: i/o timeout"

I am attaching the logs snapshot for reference.

Also, this is the full scaling policy

  policy {
    cooldown            = "1m"
    evaluation_interval = "30s"

    check "cpu_allocated_percentage" {
      source = "prometheus"
      query = "ceil((sum(nomad_client_allocs_cpu_allocated{namespace=\"${namespace}\"}) + sum(nomad_nomad_blocked_evals_job_cpu{namespace=\"${namespace}\"} OR on() vector(0)))/6000)"

      strategy "pass-through" {}
    }


    target "aws-asg" {
      dry-run             = "false"
      aws_asg_name        = "${autoscalegroup_name}"
      node_class          = "hashistack-client"
      node_drain_deadline = "5m"
    }
  }

As of now, autoscaler leave the nodes from ASG as ineligible. Ideally, the nodes should have been removed.

Let me know, if anything else is required.

Thanks
autoscaler_logs_12thOct2022.txt (249.3 KB)

jrasell · October 14, 2022, 6:54am

Hi @jdk2588,

It seems the autoscaler incurs a network problem when monitoring the node for draining. Is there any proxy between the nomad-autoscaler and the Nomad agent it talks to or any network instability?

Thanks,
jrasell and the Nomad team

Topic		Replies	Views
Autoscaler and bounds nop scaling Nomad	0	151	July 31, 2023
[aws][nomad-autoscaler] Multi Region AWS ASG in one nomad container Nomad	2	543	April 16, 2021
Nomad Autoscaler: how to delay scaling evaluation during allocation startup Nomad	4	248	April 26, 2023
Why did Nomad Autoscaler (v0.3.5) scale down unexpectedly Nomad	0	182	February 24, 2023
Nomad No nodes were eligible for evaluation Nomad	14	1431	January 16, 2024

[Nomad][autoscaler] Worker nodes marked as ineligible even with blocked evaluations

Related topics