How big is too big? Advice wanted for large environments

I’m looking for feedback regarding a large production environment, and the quantity of resources my terraform will provision.

Imagine an N-tier application, with a large number of VMs and related resources. 300 to 500 VMs is a reasonable estimate in terms of scale, and with them are supporting resources like disk encryption sets, data disks, ASGs, NSGs, availability sets, PPGs, etc. The plan output is around 50k lines currently for a ground-up build. This footprint needs to be deployed globally at this scale in each region we operate in. Ground-up reprovisioning is not something we’d do frequently, only to accommodate major migrations. Otherwise, the plan/apply will be done to manage configuration and to scale up/down as required, and won’t be provisioning thousands of resources each run.

Currently, the terraform for this consists of about 15 different modules, with one parent module invoking the rest as child modules and children of child modules. The modules are configuration-driven per-environment, per-region, using a simple root module with a lot of that define the environment of a given region. This simple per-environment root module invokes the main child module, which in turn invokes all the other children modules.

I’m using separate repos for the locals and simple root modules, in /env/some-env-name/ and, and the primary child module (and all its children) are in their own repo to make module version dependencies possible.

So… it looks sorta like this:

EnvRepo (the simple root module, and all the config for a given env):

# each env has a simple that invokes the product module, 
# and several files defining the env's config
... etc

ModuleRepo (the TF that defines our architecture, and calls child resource modules):

# creates product-scope resources and calls product child modules

# product child modules create collections of related resources, 
# either directly creating resources or by calling resource modules

# resource modules define reusable individual or groups of 
# related resources (e.g. VM and its disks, nic, etc.)

I’m wondering if anyone can share their perspective and experience from similarly large environments stamped out globally.

Does this “monolithic” terraform make sense? There are interdependencies between the different resources, so while I might break out some (large) parts of things and run them as a collection of smaller stages individually, I still need a final stage where I bring it all together and plan/apply changes to all resources defined in a given region, so I can’t easily break it up into smaller discrete configurations.

Do you find breaking it up into smaller chunks is helpful, and what are the advantages?

Are there concerns/ constraints with the approach I’m taking?

I’m looking into ways to make planning and provisioning fast, e.g. -parallelism parameter, and have another thread on that specific topic. If there are other tricks to this I’d be interested.