How do nomad task driver plugins actually work?

I’ve written a custom plugin based on the skeleton example, and it all works fine.

However, I still don’t really understand how they actually work, what the lifecycle is, and I can’t really find any good description. I’ve tried reading the code, but so far I’ve failed to gain any real insight.

For example, it’s not clear to me how many plugins get instantiated. Do I end up with one plugin executable running for each task that is launched? It seems that multiple copies of my plugin are run, but I’m unsure exactly how they are managed and what’s responsible for what. If I knew that there was going to be a one to one relationship between task driver executables and tasks, there are things I could do in the plugin directly that otherwise I’m having to do in another wrapper executable (so task driver launches second executable that does some stuff and then launches the actual executable to run the task).

Is there a good description somewhere?

Thanks.

Hi @tomqwpl :wave:

That’s awesome! I would love to hear more about it if you don’t mind :grinning_face_with_smiling_eyes:

That’s a very good point, and we certainly need to improve our docs on this. Maybe this presentation can help you understand things better?

No. Nomad will start one plugin process in each client that has it installed. This process will then be responsible for managing all the tasks.

That’s odd :thinking:

Maybe Nomad was force-quit and didn’t have time to stop the plugin process, and so they accumulated over time? And are they actually your plugin process or maybe a child process that your plugin creates for each task?

Checkout the video I linked and see if it helps you. If your plugin is open source it would nice to have a link as well (if possible).

Thanks for the link to the video, very useful. It confirms, you have confirmed how I expected the plugins to work, that you have one instance of the plugin that manages all the tasks of that given type.
I’ll have to delve further into what I’m seeing. Perhaps I just got confused with the code, trying to follow through how the “exec” stuff works and how it tracks and reconnects to processes when necessary. There are also nomad log messages where it appears to be starting the plugin multiple times, but again, that could be misinterpreting things.

I’ll have another play around and see if I can gain any further insight.

Thanks

1 Like

I think my confusion actually comes from the “executor” framework, in that I think that this then starts up another copy of my plugin. At the moment I’m unclear what value this executor framework gives me if all I want to do is launch a local executor (not interested in containers and so on, just really want a raw golang os/exec Cmd interface. We’re wanting to make some changes to the way the processes are launched, and so far I can’t work out how it hangs together.

So I think my question is really about how a task driver plugin and and the executor framework used by the skeleton driver work together.

It looks like whenever the task driver launches an executable it does so by creating an “executor” plugin and this launches another copy of the plugin executable, and then the “executor” plugin ultimately launches the real “workload” executable… That instance of the plugin executable manages only one “workload” executable. If the task driver plugin has to be restarted, it reconnects to the “executor” plugins. I think I had originally envisaged that this reconnection would be to the “workload” executable, if you see what I mean.

At the moment I’m not understanding the purpose behind all of this and was hoping to find some design docs around it. At the moment I’m thinking it would be easier to do what we need to do by just using os/exec Cmd directly, but I’m sure there must be some reason why it’s done this way that I’m not seeing at the moment.

Any further suggestions on this?
I have just run a garbage collection on my nomad client. I have no jobs. Yet I have 9 copies of my task driver executable running.
This feels like it ought not to be the case to me.

That’s strange, the executor doesn’t start any plugin instances, your plugin can call it if you need it, but it doesn’t start any other instances. Here’s an example running the default skeleton driver:

The provides some Nomad capabilities to you, such as network and filesystem isolation (when supported by the OS), exec functionality, managing subprocess etc.

If you are calling os/exec directly that might be your problem. You will need to make sure to stop any process your plugin forks.

Not sure if I follow…there should be one process, which is your plugin. It can use the executor library to help you start and manage subprocesses for each allocation that is launched, but it wouldn’t start the plugin itself again.

If the plugin crashes, Nomad will relaunch it and submit the known state back it, and the plugin will be responsible for reconciling the actual state (processes that are actually running for example) vs. the Nomad known state.

Would you be able to provide a minimal reproduction source code? It’s very hard to see what could going wrong without looking at code unfortunately.

Yes, actually it does.
Since this original post I believe I know have a complete understanding of how this all works, and indeed rewritten the executor part for our purposes.

In the skeleton there is a line:

exec, pluginClient, err := executor.CreateExecutor(d.logger, d.nomadConfig, executorConfig)

This starts another copy of your task driver plugin to act as a host for an instance of the Executor (follow the code).
Then when you do:

ps, err := exec.Launch(execCmd)

that’s sending a gRPC message to that second process. This second copy of the task driver plugin is responsible for managing that one launched “workload” process. It waits for it to finish, captures the exit code and so on.

So you end up with one copy of your task driver process hosting your actual task driver plugin, then another copy of the task driver process for each “workload” executable that the task driver then launches.

My issue with getting lots of left over processes is that you need to make sure that you call

pluginClient.Kill()

on all error return paths from StartTask, otherwise these other instances of the task driver executable never get terminated.

1 Like