Low latency task allocation issue

tiemenslotboom · April 9, 2026, 5:04pm

Our test setup consists of 5 nodes: 2 clients and 3 servers. Our system is a highly distributed system by nature; however from a resilience perspective the system must be able to run on a single (client) node. Therefore fast rescheduling a large amount of tasks to a (client) node is paramount. We strive for rescheduling 100 tasks sub-second.

We want to reschedule 135 tasks after a node failure: all 135 tasks are running on one node, since the other node was manually drained. We want to reschedule all these 135 tasks, after a node failure, towards the other node. We do understand that it is better to deploy these tasks across a larger number of nodes.

We observed that the client processes these tasks in batches (taking tasks from a queue?), see picture below. We are wondering what the delay in between these batches is. This seems to be internal behavior of the Nomad client and can not be configured in the client configuration.

We have a system with enough resources (cpu, memory and bandwidth) where we would like to increase the parallelism or even remove the batching behavior for the client.
Are there any possibilities for doing so?

Topic		Replies	Views
Batch jobs: improve performances and number of concurrent executions Nomad	1	820	April 29, 2022
Nomad allocations placement Nomad nomad	2	334	March 13, 2024
Auto migrate when nomad client get low resources Nomad	1	671	November 5, 2020
Is it better one job with 1000 tasks or 1000 jobs with one task? Nomad	7	1077	January 30, 2026
Nomad not rescheduling allocations due to high usage on one node Nomad	2	4392	March 8, 2021

Low latency task allocation issue

Related topics