Deploy agent on customer site

I’m working with heavy computational workloads (ML/AI), there is interest from our clients that their own data does not leave their servers. So I’m wondering would it be possible to deploy a nomad agent on the customers side without having any form of vulnerabilities in terms of “malicious use” from the client side.

I have looked trough Secure Nomad with Access Control | Nomad - HashiCorp Learn but it seems more about restrictions on the user and not the node, and I’m not familiar enough with nomad to say exactly what an agent can see and do.

Hi @joh4n

So I asked around and from what I have gathered we do have customers doing it. So it is possible.

The key takeaways I have for you are that limiting access is going to have to very thoughtfully done. It’s going to be a combination of Namespaces, ACLs, and token management. You will need to think through your threat model pretty deeply and define what malicious activity means. If you have additional details around that you want to think through I’m happy to help.

Out of curiosity, are you an open source user or an enterprise customer? Sentinel is another layer that might be able to help, but is enterprise only.

@DerekStrickland
Atm an open source user, we are currently trying to figure out how to use it for our machine learning workloads. Where this would be a gr8 functionality for us but it is something we would look in to implementing in a few months, after we got the basic setup done first. But I find it important to understand how we utilize nomad correctly such we can support this in in the future and not make bad design decisions in the initial setup but have a plan of how it should look.

I totally understand, and kudos to you for thinking ahead. I was just thinking Sentinel might have some value, and also that if you were already an enterprise customer, you might be able to leverage our SE team to help in real time rather than having to depend on async discuss forums.

That said, I’m confident users are running isolated on-prem nodes that can connect back to the primary DC servers. I do think there are a lot of case specific questions to be answered, and having a deep understanding of the ACL system is going to be a requirement.

Feel free to reach out if or when you have specific questions. This kind of edge computing should be where Nomad shines, and we’d really like you to be successful.

Cheers

  • Derek

I wonder if the concept of fedaration would be a good fit.

Hi @joh4n :wave:

The Nomad security model may help you figuring out your requirements.

ACLs and mTLS encryption are definitely a must for any production environment, especially so in your case since job operators should be considered a potential threat vector.

In a nutshell, clients forward job registration requests to the leader server, which then schedules the job in its internal state. Clients then poll the servers for information on what they should run.

So the servers will never run any user-provided code, all allocations always run in clients. But you can have jobs interfering with other jobs running in the same client. For example, a malicious alloc that runs a Docker privileged escalation exploit, or even just a crypto miner that will use all computing resources.

To prevent this, you should use namespaces to isolate jobs and permissions. This means that each of your customers will get their own namespace, and so will only be able to interact with jobs in that namespace.

But namespaces don’t restrict where jobs actually run. For this, you will need to configure the agent running in the customer site with metadata and restrict their jobs to only run there.

You can do this with datacenters, but these are usually human-readable values that can be easily guessed (if their assigned datacenter is dc1 it’s likely that dc2 also exists :sweat_smile:). I think Sentinel, as @DerekStrickland mentioned, could be used to restrict this, but that doesn’t work for you as an OSS user.

Another option would be to generate a random value, like an API key of sorts, and configure a client meta value in their agent. Then use a constraint with a client meta attribute to isolate clients. The problem with this approach is that constraints are optional fields, so they could easily by-pass. Again, Sentinel would be helpful here :sweat_smile:

So a hybrid approach would probably be the best OSS option: a datacenter configuration in the clients with a randomly generate key. datacenters is a required field in the jobspec, so they wouldn’t be able to by-pass it. They could try to guess other keys, but if the keys are long enough it could take a long time. Placing the Nomad API behind a proxy with rate-limiting could also help mitigate this kind of abuse.

For complete isolation, @resmo idea of using multiple federated clusters could be a good idea. Each customer would be assigned a different region, and you could join the regions together for easier management. For more details, checkout this tutorial: Multi-Region Federation | Nomad - HashiCorp Learn

Each Nomad region is completely isolated from the others. If you join them, only the servers will communicate among them to forward messages around. I this topology you would also want to make sure ACL tokens are not propagated, so that each customer token only works in their region.

The downside of this approach is the additional overhead of maintaining multiple micro-clusters.

I hope this gives you some insights. Let me know if you have any questions left :slightly_smiling_face: