I am trying to restart the jobs, after I drained a node, on the same one. The service and system type jobs I was able to restart but I have a problem with batch type job. Is there a way on how to restart them?
After the (force) draining it’s finished, the job is going in dead state and I am not able to restart it.
We are using batch type jobs to train different models, but from time to time one GPU is throwing an error. In order to use again all GPUs on that server, we need to drain ( force ) the node, restart the server and after the nomad is up an running, to disable the drain. All jobs types, except the batch type, are starting.