Theorycrafting: is it better to have one large cluster, or many smaller ones?

benvanstaveren · November 9, 2022, 1:59pm

So. I need the collective wisdom of the community because this question has arisen. Following a company reorganisation (we split off from a monolith into a holding with multiple separate groups) it’s that time to clear out some technical debt, and that has lead to, well, the question above.

What is considered “best practice”? The situation is as follows; we currently run 4 clusters that are not necessarily based on “region” as much as the jobs they perform, while this works, it also doesn’t translate well to going actual physical multi-region. I’m edging towards the case where we run 1 cluster tied to the actual region we operate in, and use some node meta keys to separate out nodes dedicated to each daughter company. Our CTO is on board with that but also suggested perhaps giving each daughter company their own cluster.

I’m also okay with that; separation of concerns and such. But then again there’s also namespaces. And now we’ve reached an impasse where both the CTO and myself are like “well, they both work”. And each has their own pro’s and con’s.

So help a brother out here and spin me a yarn about what you think would be the way to do this.

nathan · November 9, 2022, 2:48pm

Howdy.

I suspect we’ll get a comment from the Nomad team so I’ll leave a lot of the real technical analysis aside for the time being.

I’m not a huge fan of the term “best practices” for precisely these sorts of scenario where there’s a fair bit of nuance to take into account relating to immediate and long-term relationship between the parent org and subsidiaries. SLAs/SLOs/etc for each of the in-scope workloads also figure in. I much prefer “field patterns”. Hope that doesn’t come off as pedantic but trying to emphasize that the designs are usually tuned to the specific business environment.

In any case, two things:

Have you considered federating? (As opposed to MR.) The way I read your question I think you have/are. Which from where I sit right now is probably the best way forward.
You are doing some pretty advanced thinking about your infra design which might imply you are an Enterprise customer? The question you are asking is probably a bit higher level than a straight Support ticket but if you are a customer your Customer Success Manager (CSM) may be able to hook you up with a Nomad field specialist (technically the title is Customer Success Architect).

Ok, with that out of the way, I’m going to monitor this discussion and see if someone from the Nomad team chimes in.

benvanstaveren · November 9, 2022, 3:36pm

I’m stealing “field patterns” for sure. And really there is no such thing as pedantic when it comes to things like this

So, yeah, we are federating 4 clusters right now - it’s not an ideal setup right now since it sort of grew organically over time because we were in that good old “we’ll do it right later” mode. Everyone knows how that ends

Currently not enterprise customers, on account of our company and it’s associated budget being smallish enough that the outlay isn’t something I can convince the beancounters of. Maybe some time, but not just yet.

The reason I’m thinking about the infra right now is that when I joined the company almost a decade ago nobody had really thought about it, and we had a lot of technical debt that we couldn’t clear because there were only 2 of us devops types around, and for legacy reasons we couldn’t just burn it all to the ground and start fresh.

Fortunately (well, from a certain point of view), right now is the right time to burn it all to the ground and start fresh because I’ve got about a month and a half to re-arrange infra to the point it’s a good foundation for the next few years. Yeah, lofty goal, but… keeps the job entertaining.

I’ll have another think and some doodling, perhaps a beer or two Guess I succeeded in asking a complicated question

nathan · November 9, 2022, 4:00pm

Understood on the budget thing. Given the nuance of your question, I was hoping we could move the discussion to direct engagement to reduce info loss inherent to forums and also because I thought it might let us talk about more openly about the business plans which will inform the design.

Thanks for confirming on the federation side of things.

If you are comfortable sharing details as to how you currently automate deploying your workloads that might be helpful information (ex: direct access to Nomad CLI/API by teams, via SNOW, via CI/CD / GitOps, etc).

I’ll be back tomorrow to check on things.

benvanstaveren · November 9, 2022, 4:34pm

We have a frontend that I put together that lets people basically point-and-click their deployments; the initial setup consists of manually setting up various bits and pieces in Vault (db creds, policies, KV access, etc.) and either using a pre-cooked job template or customising something if someone needs a sidecar. It’s dumb in the sense that it just talks to 1 cluster and courtesy of federation, things just magically work out.

After that people can see the status of the allocation, and they can update “their” task with a new image; the image list gets pulled out of our docker repository. There’s also the option for them to generate a token that can be used by a CI/CD job (gitlab, on prem) to trigger an update.

It also automatically sets up a prod/dev environment, as in, CI/CD can only trigger image updates on the dev job, once that’s all considered tested and done with, project owner (or whoever he has given the permissions to) can then “promote” the dev job to prod.

Done that way because while our developers are very much capable of developing, when I explained nomad jobs to them, there were lots of glazed eyes and deer-in-headlights looks, even some wailing and gnashing of teeth so I figured okay, I’ll provide the easiest/simplest UI that lets them do their thing, and I keep my grubby fingers in the Nomad pie all the way because eh, why not.

One thing that does need saying, currently there’s no such things as ACL’s. That’s one reason to perhaps set it up “from scratch” as it were, because bootstrapping ACL on a running cluster (let alone 4) is looking like it could go wrong in very interesting ways.

Topic		Replies	Views
Single-node Nomad (combined server/client agent) feedback Nomad	20	4635	July 15, 2024
Nomad for Edge Compute Nomad	0	394	July 16, 2021
1 Machine per Cluster Nomad	2	1130	November 30, 2020
Nomad datacenter vs Consul datacenter Nomad consul-nomad	4	1400	September 28, 2021
Nomad general limits on node/cluster size Nomad	0	29	October 17, 2024

Theorycrafting: is it better to have one large cluster, or many smaller ones?

Related topics