The Nomad security model may help you figuring out your requirements.
ACLs and mTLS encryption are definitely a must for any production environment, especially so in your case since job operators should be considered a potential threat vector.
In a nutshell, clients forward job registration requests to the leader server, which then schedules the job in its internal state. Clients then poll the servers for information on what they should run.
So the servers will never run any user-provided code, all allocations always run in clients. But you can have jobs interfering with other jobs running in the same client. For example, a malicious alloc that runs a Docker privileged escalation exploit, or even just a crypto miner that will use all computing resources.
To prevent this, you should use namespaces to isolate jobs and permissions. This means that each of your customers will get their own namespace, and so will only be able to interact with jobs in that namespace.
But namespaces don’t restrict where jobs actually run. For this, you will need to configure the agent running in the customer site with metadata and restrict their jobs to only run there.
You can do this with datacenters, but these are usually human-readable values that can be easily guessed (if their assigned datacenter is
dc1 it’s likely that
dc2 also exists ). I think Sentinel, as @DerekStrickland mentioned, could be used to restrict this, but that doesn’t work for you as an OSS user.
Another option would be to generate a random value, like an API key of sorts, and configure a client
meta value in their agent. Then use a
constraint with a client meta attribute to isolate clients. The problem with this approach is that
constraints are optional fields, so they could easily by-pass. Again, Sentinel would be helpful here
So a hybrid approach would probably be the best OSS option: a
datacenter configuration in the clients with a randomly generate key.
datacenters is a required field in the jobspec, so they wouldn’t be able to by-pass it. They could try to guess other keys, but if the keys are long enough it could take a long time. Placing the Nomad API behind a proxy with rate-limiting could also help mitigate this kind of abuse.
For complete isolation, @resmo idea of using multiple federated clusters could be a good idea. Each customer would be assigned a different
region, and you could join the regions together for easier management. For more details, checkout this tutorial: Multi-Region Federation | Nomad - HashiCorp Learn
Each Nomad region is completely isolated from the others. If you join them, only the servers will communicate among them to forward messages around. I this topology you would also want to make sure ACL tokens are not propagated, so that each customer token only works in their region.
The downside of this approach is the additional overhead of maintaining multiple micro-clusters.
I hope this gives you some insights. Let me know if you have any questions left