Fault injection/chaos tools for Nomad?

Hey folks.

Does anyone know of any work on chaos engineering tools that leverages Nomad? I’m currently building out a Nomad-based platform and want to set up some automated testing of our service availability. I’m interested in using this both in dev/integration testing and potentially in production (but with some serious safe-guards like Human In the Loop for all prod event generation).

If it doesn’t exist, is anyone interested in collaborating on building it in an open source project?

JC

1 Like

:raised_hand:

I already thought about this and an approach I came up with was not related to nomad’s API but docker (since we only run docker/container workload).

A very minimalistc way would be:

docker kill $(docker ps --filter "label=chaos-opt-in" -q | sort --random-sort | head -n 1)

Running it as periodic batch job with exec_raw.

Any thoughts?

Good idea re labeling things that it’s OK to kill.

I’m currently running exec workloads - we’ll be moving to containers later in the year - so Docker is not yet in the picture. I’m thinking to interrogate Nomad itself to discover the running jobs and allocations, and writing the whole thing as a Python script that uses python-nomad to access the cluster metadata.

This still leaves the question of how to terminate the processes running within an alloc once it is picked. Once we have containers a variation on your approach should work. But there should probably be a plugin interface that allows one or more termination mechanism per task driver type.

I’m going to prototype the above and will share a link to the repo here once I have it going.

I am actually one of the maintainers of python-nomad even though I haven’t much time lately. That said, python is totally fine for me as well.

Hi @j.c, nice to hear from you!

This is something that we have discussed internally. It would be great to see whatever you prototype and we would love to chat and discuss this further.

I am curious if the alloc exec API would be useful to the open question of terminating processes running within an allocation?

Thanks,
jrasell and the Nomad team

1 Like

Haven’t tried it myself, but there’s nomad alloc signal (requires Nomad > 0.92):

Sending a SIGKILL should probably do the trick.

I built a minimal fault injection tool called “chaotic” and implemented nomad support: GitHub - ngine-io/chaotic: Chaos for Clouds

3 Likes

Find an example nomad periodic job in ngine / docker-images / chaotic · GitLab