Me and my team run ETL jobs via python on dask. I was wondering what would be the Nomad way of running an ETL job? I think we’d still need a Dark scheduler and Dask workers running on Nomad. I especially like the idea of submitting ETL workflows like so ‘workflow run workflow_name’ and defining ETL workflows with hcl text files.
Answering my own question. The nomad way would be using dispatch jobs. A dispatch job is a
function, a job with a parameter. If you are doing ETL with python you can have 1 job template
run xyz python script and then a dispatch job could call an api to say
In this approach you’d have to parallelize python functions manually by splitting the work and sending it as various dispatch jobs over the api. Ultimately this is the top difference with Dask.
In the case of the
Extract part of
ETL Nomad could still be used and it might be simpler to use than Dask. Say you are extracting data from a RDBS table. You could split the data in various queries and send each query over to a nomad job with dispatch. This can be done in parallel (only limitation would be the RDBS, not nomad).