Making sense of "failed to place allocation" logs

wimax-grapl · October 8, 2021, 8:27pm

Hi there. I have a nomad job run that uploads 1 job with 5 tasks in it to a nomad agent -dev.
Sometimes I see logs like the following:

==> 2021-10-08T20:06:04Z: Monitoring evaluation "0e69db14-1c06-6e6b-cdd2-4553e9709dab"
    2021-10-08T20:06:04Z: Evaluation triggered by job "grapl-local-infra"
==> 2021-10-08T20:06:05Z: Monitoring evaluation "0e69db14-1c06-6e6b-cdd2-4553e9709dab"
    2021-10-08T20:06:05Z: Evaluation within deployment: "4e8448b1-a9c6-bb85-0624-aeb80ed3b2da"
    2021-10-08T20:06:05Z: Evaluation status changed: "pending" -> "complete"
==> 2021-10-08T20:06:05Z: Evaluation "0e69db14-1c06-6e6b-cdd2-4553e9709dab" finished with status "complete" but failed to place all allocations:
    2021-10-08T20:06:05Z: Task Group "localstack" (failed to place 1 allocation):
      * No nodes were eligible for evaluation
      * No nodes are available in datacenter "dc1"
    2021-10-08T20:06:05Z: Task Group "ratel" (failed to place 1 allocation):
      * No nodes were eligible for evaluation
      * No nodes are available in datacenter "dc1"
    2021-10-08T20:06:05Z: Task Group "kafka" (failed to place 1 allocation):
      * No nodes were eligible for evaluation
      * No nodes are available in datacenter "dc1"
    2021-10-08T20:06:05Z: Task Group "zookeeper" (failed to place 1 allocation):
      * No nodes were eligible for evaluation
      * No nodes are available in datacenter "dc1"
    2021-10-08T20:06:05Z: Task Group "redis" (failed to place 1 allocation):
      * No nodes were eligible for evaluation
      * No nodes are available in datacenter "dc1"
    2021-10-08T20:06:05Z: Evaluation "5b157950-dd6a-c9e4-44af-57f25edaa4a2" waiting for additional capacity to place remainder
==> 2021-10-08T20:06:05Z: Monitoring deployment "4e8448b1-a9c6-bb85-0624-aeb80ed3b2da"
 
2021-10-08T20:06:40Z
ID          = 4e8448b1-a9c6-bb85-0624-aeb80ed3b2da
Job ID      = grapl-local-infra
Job Version = 0
Status      = successful
Description = Deployment completed successfully
 
Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
kafka       1        1       1        0          2021-10-08T20:16:38Z
localstack  1        1       1        0          2021-10-08T20:16:39Z
ratel       1        1       1        0          2021-10-08T20:16:18Z
redis       1        1       1        0          2021-10-08T20:16:18Z
zookeeper   1        1       1        0          2021-10-08T20:16:38Z
 
Allocations
ID                                    Eval ID                               Node ID                               Node Name                   Task Group  Version  Desired  Status   Created               Modified
32ee55c1-80cd-b299-6757-e78432ca94e1  5b157950-dd6a-c9e4-44af-57f25edaa4a2  75168830-24d1-6da6-7652-bb6606aea354  ip-10-0-5-206.ec2.internal  redis       0        run      running  2021-10-08T20:06:04Z  2021-10-08T20:06:18Z
7b674228-17be-932f-f999-afb881b8ccb7  5b157950-dd6a-c9e4-44af-57f25edaa4a2  75168830-24d1-6da6-7652-bb6606aea354  ip-10-0-5-206.ec2.internal  zookeeper   0        run      running  2021-10-08T20:06:04Z  2021-10-08T20:06:38Z
88077d7b-2dd1-d17c-f795-ac093893c381  5b157950-dd6a-c9e4-44af-57f25edaa4a2  75168830-24d1-6da6-7652-bb6606aea354  ip-10-0-5-206.ec2.internal  ratel       0        run      running  2021-10-08T20:06:04Z  2021-10-08T20:06:18Z
b0c9cfad-c2aa-673a-41fe-019f24ae329b  5b157950-dd6a-c9e4-44af-57f25edaa4a2  75168830-24d1-6da6-7652-bb6606aea354  ip-10-0-5-206.ec2.internal  kafka       0        run      running  2021-10-08T20:06:04Z  2021-10-08T20:06:38Z
e5654b8c-1fcc-8842-1598-e3c8a19645c0  5b157950-dd6a-c9e4-44af-57f25edaa4a2  75168830-24d1-6da6-7652-bb6606aea354  ip-10-0-5-206.ec2.internal  localstack  0        run      running  2021-10-08T20:06:04Z  2021-10-08T20:06:39Z

<and then an exit code 2>

So, my takeaway from these logs are:

The first evaluation 0e69db14 had problems
A guess: Nomad automatically rescheduled another evaluation 5b157950, which succeeded

And then, after this successful retry, it still gives me an exit code 2 because, per the documentation:

On successful job submission and scheduling, exit code 0 will be returned. If there are job placement issues encountered (unsatisfiable constraints, resource exhaustion, etc), then the exit code will be 2. Any other errors, including client connection issues or internal errors, are indicated by exit code 1.

So, 2 questions here:

Is my read of the situation correct? I’m getting an exit code 2 despite having seemingly-successfully-running tasks?
I suspect my logic that waits for my nomad agent -dev & to be ready may be broken - it is currently the following:

timeout 120 bash -c -- 'while [[ -z $(nomad status 2>&1 | grep running) ]]; do printf "Waiting for nomad-agent\n";sleep 1;done'

perhaps a check for ready in nomad node status might be preferable?

Topic		Replies	Views
Evaluation: maximum attempts reached (5) Nomad	1	873	December 3, 2020
Placement failures - how do I debug it? Nomad	2	4209	February 22, 2020
Job with 3 native tasks fail on allocation, cannot get logs to troubleshoot Nomad	1	489	October 23, 2020
Nomad placement failures unrelated constraints and resource allocation Nomad	0	728	May 24, 2022
Detecting Resource Exhaustion / Placement Failure Nomad	3	1079	January 25, 2020

Making sense of "failed to place allocation" logs

Related topics