Does anyone know of any work on chaos engineering tools that leverages Nomad? I’m currently building out a Nomad-based platform and want to set up some automated testing of our service availability. I’m interested in using this both in dev/integration testing and potentially in production (but with some serious safe-guards like Human In the Loop for all prod event generation).
If it doesn’t exist, is anyone interested in collaborating on building it in an open source project?
Good idea re labeling things that it’s OK to kill.
I’m currently running exec workloads - we’ll be moving to containers later in the year - so Docker is not yet in the picture. I’m thinking to interrogate Nomad itself to discover the running jobs and allocations, and writing the whole thing as a Python script that uses python-nomad to access the cluster metadata.
This still leaves the question of how to terminate the processes running within an alloc once it is picked. Once we have containers a variation on your approach should work. But there should probably be a plugin interface that allows one or more termination mechanism per task driver type.
I’m going to prototype the above and will share a link to the repo here once I have it going.
This is something that we have discussed internally. It would be great to see whatever you prototype and we would love to chat and discuss this further.
I am curious if the alloc exec API would be useful to the open question of terminating processes running within an allocation?