Advice on development workflows

Hi,

I’m currently managing the infrastructure for a fairly complex application. This application runs elastic user workloads, so currently it uses Kubernetes. We use AWS as our hosting, with EKS for Kubernetes and PostgreSQL RDS for some application data.

We keep the source code for the application and its infrastructure on the same repo.

For general development, we support a few setups that either rely on running the whole infrastructure locally, with Kind to run the Kubernetes parts; or running the Kubernetes parts on AWS for some specific purposes (low capacity workstations, GitHub PR previews).

This requires some extra work, but is working well enough. We have a shared Kubernetes cluster for the AWS development environments and replace RDS with PostgreSQL running on the Kubernetes cluster. Developers can provision environments following a few simple steps. Much of our Terraform infrastructure is used to build those development environments, so even much Terraform development can be done on those.

However, we currently don’t have a “good” solution to develop the full production-like environments.

Provisioning a production-like environment is very slow and inconvenient; among many others, RDS and EKS take ages to provision and deprovision, so only when we do major changes, we set up a full production-like environment for development.

For minor changes, we have production and a staging environment; we develop the change and run Terraform manually to apply the changes to staging while developing and testing. We do this as part of a PR, and when the PR is merged, we apply to production.

This actually works quite well, but I have two issues:

  • Concurrent work is not possible. So far, only I work full-time on infra and other members of the team only develop infrastructure sporadically, but we’re likely going to add a second person focused on infrastructure work.
  • There’s still quite a bit of manual work. We don’t make so many errors, but it’s tedious.

What do people do in general?

  • Do people feel confident in making changes to Terraform without applying them? `terraform plan` is helpful, but I think many times it’s not possible to verify that your changes are correct by just looking at `terraform plan`.
  • Do people just bite the bullet and take the pain of spinning up full environments for any changes that need them, even if it takes 1h?
  • … or maybe you keep a full environment for every person working on infrastructure?

I sometimes create temporary partial environments for some changes, with only the bits I’m developing on. This works fairly well because in many cases, I just need to make changes to EKS and just avoid having to deal with RDS and the rest of components. But this feels a bit ad-hoc and improvised.

Thoughts? Opinions? Experiences?

Cheers,

ÁLex