Low latency task allocation issue

Our test setup consists of 5 nodes: 2 clients and 3 servers. Our system is a highly distributed system by nature; however from a resilience perspective the system must be able to run on a single (client) node. Therefore fast rescheduling a large amount of tasks to a (client) node is paramount. We strive for rescheduling 100 tasks sub-second.

We want to reschedule 135 tasks after a node failure: all 135 tasks are running on one node, since the other node was manually drained. We want to reschedule all these 135 tasks, after a node failure, towards the other node. We do understand that it is better to deploy these tasks across a larger number of nodes.

We observed that the client processes these tasks in batches (taking tasks from a queue?), see picture below. We are wondering what the delay in between these batches is. This seems to be internal behavior of the Nomad client and can not be configured in the client configuration.

We have a system with enough resources (cpu, memory and bandwidth) where we would like to increase the parallelism or even remove the batching behavior for the client.
Are there any possibilities for doing so?

1 Like