Batch jobs: improve performances and number of concurrent executions

zalken · February 25, 2022, 9:27pm

Context

I’m already using Nomad for some light loads of batch jobs, for which it works great.
I now evaluate Nomad potential to schedule and dispatch heavier loads (1k-10k) of batch jobs.

I’m doing some load tests and benchmarks, executing simple jobs cmd.exe /c echo 'Hello World!' on Windows nodes. My goal is to see how much time Nomad server needs to schedule the jobs, and how fast Nomad client can execute the actions.

Main question

My main question is: what is the recommanded configuration to make it go fast, so that nodes are able to execute many (20, 50, or even 100) jobs concurrently?

I searched the documentation, online forums and issues, but couldn’t find much info regarding concurrent allocations on nodes. Nor performance expectations for batch jobs.

Observations

From my observations, with a job containing 1000 task groups (or 1000 jobs of 1 task group, results are the same), allocations are set to started status in less than 10 seconds.

As far as I can tell, at this stage the scheduling part is over, as the client is already known: the client ID which will execute the action is displayed on each task group in the UI. Besides, my job specifically defines constraint to target one single node.
But then, tasks are only executed/dequeued in small batch, from 0 to 5 concurrently. I’d expect such a simple command line to be much more paralellisable. In the end, it takes approximately 3min to run 1000 echo "Hello World!"
On the client node, data/client/state.db reaches 2,097,152 KB quite fast, and doesn’t shrink until it’s restarted/removed manually. nomad.exe process grows to 7GB of RAM, and doesn’t release it until several hours after the last execution (even after jobs are GCed from servers and allocs are cleaned locally)
For 10k jobs, it takes 30-40 minutes to complete

Thoughts

Is the throughput controlled by servers, or clients? After upgrading servers from 1.1.1 to 1.2.6 the client was able to run 2x more jobs concurrently, but sadly next day it got the same performances as before, even after restarting the client.
Could it be affected by GC? Should I fine tune the GC on servers/clients for this throughput? What would be appropriate values? On client, I tried to tweak gc_max_allocs with standard (50) as well as huge (50,000) values, didn’t feel a real difference. At the same time I played with gc_parallel_destroys without much success either. I regularly issued nomad system gc commands, and it seems to help for the overall stability of the cluster, when it’s filled with thousands of jobs.
Is raw_exec resilient enough for massive parallel execution? Although tasks seem to execute correctly (couldn’t check all 1000 however), client logs show weird errors about OpenProcess: The parameter is incorrect:

2022-02-25T13:02:58.516Z [WARN]  client.alloc_runner.task_runner: some environment variables not available for rendering: alloc_id=8fbee585-dd08-e393-aac8-0cac23b4217e task=ping keys=""
2022-02-25T13:02:58.516Z [INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:C:\\Windows\\System32\\cmd.exe Args:[/c echo 'Hello World!']}"
2022-02-25T13:02:58.522Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad.exe: opening fifo: alloc_id=530735ef-5a4e-89e3-87d7-36c5cb50475b task=ping @module=logmon path=//./pipe/ping-b49bc9e9.stdout timestamp=2022-02-25T13:02:58.522Z
2022-02-25T13:02:58.522Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad.exe: opening fifo: alloc_id=530735ef-5a4e-89e3-87d7-36c5cb50475b task=ping @module=logmon path=//./pipe/ping-b49bc9e9.stderr timestamp=2022-02-25T13:02:58.522Z
2022-02-25T13:02:58.592Z [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec err="rpc error: code = Unknown desc = executor failed to find process: OpenProcess: The parameter is incorrect."
2022-02-25T13:02:58.603Z [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=51dd60be-cac6-94cf-e27f-6091069ad858 task=ping reason="Restart unnecessary as task terminated successfully"

On the C2M client configuration I could see that some options are set on the client, but I couldn’t find anything releated when searching through the Go code. Are those options of any interest?

  options {
    "alloc.rate_limit" = "50"
    "alloc.rate_burst" = "2"
  }

Should the number of schedulers on servers be adjusted? Recommandation is of one scheduler per core, what would be the impact of configuring more?

Environment

Datacenter is comprised of:

Cluster is running on Nomad 1.2.6 (also tested in 1.1.1, more or less same results)
3 servers with 8 cores CPU, 30GB RAM
1 client with 8 cores CPU, 32GB RAM, SSD disk

Client configuration:

    #--- General ---
    bind_addr = "0.0.0.0"
    datacenter = "NE1"
    data_dir = "C:\\PROGRA~1\\nomad\\data"
    
    #--- Updates ---
    disable_anonymous_signature = true
    disable_update_check = true
    
    #--- Logging ---
    log_file = "C:\\Program Files\\logs\\nomad-output.log"
    log_rotate_duration = "24h"
    log_rotate_max_files = 90
    
    #--- Client networking ---
    client {
      enabled = true
    
      servers = ["xxx:4847"]
    }
    
    ports {
      http = 4846
      rpc  = 4847
      serf = 4848
    }
    
    #--- Metrics ---
    telemetry {
      publish_allocation_metrics = true
      publish_node_metrics       = true
      prometheus_metrics         = true
    }
    
    #--- mTLS ---
    tls {
      http = true
      rpc  = true
    
      ca_file   = "C:\\Program Files\\nomad\\certs\\nomad-ca.pem"
      cert_file = "C:\\Program Files\\nomad\\certs\\nomad-client.pem"
      key_file  = "C:\\Program Files\\nomad\\certs\\nomad-client-key.pem"
    
      verify_server_hostname = true
      verify_https_client    = true
    }
    
    plugin "raw_exec" {
      config {
        enabled = true
      }
    }

Job specs

Here is the levant template used:

    job "hello-world-[[ timeNowUTC ]]" {
        datacenters = ["NE1"]
        type = "batch"
    
        constraint {
            attribute = "${node.unique.name}"
            operator  = "="
            value     = "ne1-z03-htes-p1"
        }
    
        # Adapt loop counter to spawn more or less task groups
        [[ range $i := loop 1000 ]]
        group "group-[[ $i ]]" {
            count = 1
    		restart {
                attempts = 0
    		    mode = "fail"
            }
            reschedule {
                attempts  = 0
                unlimited = false
            }
    		
            task "hello-world" {
                driver = "raw_exec"
                config {
                    command = "C:\\Windows\\System32\\cmd.exe"
                    args = ["/c echo 'Hello World!'"]
                }
    
                # Giving more resources doesn't change results
                resources {
                    cpu    = 10
                    memory = 10
                }
    			
    			logs {
    				max_files     = 1
    				max_file_size = 5
    			}
            }
    		
    		ephemeral_disk {
    			size    = 10
    		}
        }
        [[ end ]]
    }

Thanks a lot for your help! Love the product, and would love even more to understand how to take full advantage of it

zalken · April 29, 2022, 6:27pm

Bump, in hope to get traction

Topic		Replies	Views
Nomad v0.12.6 repeatedly killed by oom_reaper after a few thousand completed batch jobs Nomad	3	1099	October 27, 2020
Is it better one job with 1000 tasks or 1000 jobs with one task? Nomad	6	876	February 1, 2023
Tips for running nomad in resource-constrained environments? Nomad	8	2274	September 30, 2021
[nomad][server] leader change per minute when high workload Nomad	4	1281	April 27, 2021
Multi-node batch MPI jobs (Slurm style) Nomad	4	2389	October 25, 2023