Fault injection/chaos tools for Nomad?

j.c · May 19, 2021, 7:22pm

Hey folks.

Does anyone know of any work on chaos engineering tools that leverages Nomad? I’m currently building out a Nomad-based platform and want to set up some automated testing of our service availability. I’m interested in using this both in dev/integration testing and potentially in production (but with some serious safe-guards like Human In the Loop for all prod event generation).

If it doesn’t exist, is anyone interested in collaborating on building it in an open source project?

JC

resmo · May 21, 2021, 12:49pm

I already thought about this and an approach I came up with was not related to nomad’s API but docker (since we only run docker/container workload).

A very minimalistc way would be:

docker kill $(docker ps --filter "label=chaos-opt-in" -q | sort --random-sort | head -n 1)

Running it as periodic batch job with exec_raw.

Any thoughts?

j.c · May 21, 2021, 2:22pm

Good idea re labeling things that it’s OK to kill.

I’m currently running exec workloads - we’ll be moving to containers later in the year - so Docker is not yet in the picture. I’m thinking to interrogate Nomad itself to discover the running jobs and allocations, and writing the whole thing as a Python script that uses python-nomad to access the cluster metadata.

This still leaves the question of how to terminate the processes running within an alloc once it is picked. Once we have containers a variation on your approach should work. But there should probably be a plugin interface that allows one or more termination mechanism per task driver type.

I’m going to prototype the above and will share a link to the repo here once I have it going.

resmo · May 21, 2021, 2:36pm

I am actually one of the maintainers of python-nomad even though I haven’t much time lately. That said, python is totally fine for me as well.

jrasell · May 21, 2021, 2:41pm

Hi @j.c, nice to hear from you!

This is something that we have discussed internally. It would be great to see whatever you prototype and we would love to chat and discuss this further.

I am curious if the alloc exec API would be useful to the open question of terminating processes running within an allocation?

Thanks,
jrasell and the Nomad team

fhemberger · May 21, 2021, 2:42pm

Haven’t tried it myself, but there’s nomad alloc signal (requires Nomad > 0.92):

Sending a SIGKILL should probably do the trick.

resmo · June 9, 2021, 7:30am

I built a minimal fault injection tool called “chaotic” and implemented nomad support: GitHub - ngine-io/chaotic: Chaos for Clouds

resmo · June 16, 2021, 7:14am

Find an example nomad periodic job in ngine / docker-images / chaotic · GitLab

Topic		Replies	Views
New task driver based on systemd transient units Nomad	2	416	October 10, 2022
Sample Nomad Event Stream Python Script? Nomad	1	524	April 19, 2021
Brainstorming A Migration to Nomad; Hoping for Advice Nomad	3	561	August 25, 2021
Participate in the Topology Visualization Prototype Program Nomad	7	1940	September 30, 2020
Nomad ci/cd tool Nomad	6	1637	August 10, 2021

Fault injection/chaos tools for Nomad?

Related topics