Hi @darkn3rd,
Let me start off by saying that I don’t think Terraform is the right tool for your use-case. In fact, everything you want to achieve is doable the “Kubernetes-way”, without writing a single line of HCL (see more about this at the end).
First of all, a Kubernetes Service should not fail because the Pods didn’t come to a healthy state. A Service can exist without any Endpoints and Kubernetes should create the Load Balancer resource no matter what but maybe the Terraform provider expects a different response from the Kubernetes API, which may be part of the problem.
The biggest problem I see though is depending on a resource that’s not being created through Terraform itself. Even though the k8s Service is, the Load Balancer isn’t (it’s created by a k8s controller). There are tons of things that can go wrong in that scenario. Kubernetes can wait forever to reconcile a resource that doesn’t match its desired state. Terraform can’t, so it must make a decision.
I get your point that the data source could return an empty object. This is actually true for data sources that return lists (e.g. aws_subnets
). But if you’re adding a data source that references a single object, it’s because you want to use it somewhere else. If that object doesn’t exist, there is no point in moving forward.
In the specific case of your ELB, let’s assume, it returned an empty object. When you try to reference its dns_name
or id
in the aws_route53_record
resource, it will raise an exception since either records
or alias
must be specified. What do you do then? Even if you add some sorta condition to work around this, you would still need to re-run terraform apply
. Keep reading.
Why should Terraform delete the resources it manages because of a failure it’s not capable of tracking? If the resources were already created, why would you need to destroy them? Your apply
may be incomplete but a new plan
will figure out the differences and create the missing resources (once you’ve resolved the problem with your external resources, not being managed by Terraform). I’m not excluding the possibility of a weird behavior of the Kubernetes provider but in general, you wouldn’t need to recreate failed resources (because they wouldn’t have been added to the state in the first place - if that’s your case, maybe you should raise a bug on the provider’s Github repo).
Depending on what you have to do to fix the problem that occurred outside of Terraform’s control, you may need to reimport the resources it manages, which brings me to my next topic.
It’s true that Terraform has some limitations in that regard because it uses HCL which is a configuration language, not a procedural one, and it does assume a few things:
- If the resource exists, it should either be read as a data source or be imported to the state as a managed resource
- If it doesn’t, it will be created and managed by it
- Once managed, a resource should not be modified outside of Terraform because Terraform will attempt to return it to its desired state.
Some of these limitations could be easily overcome by using the Terraform CDK. That does not mean that Terraform as a platform is a “FAIL” though.
=== Alternatives ===
Based on what you explained your use-case is, I’d probably use the following approach:
- ACM certificate with wildcard for all k8s Ingresses (e.g. “*.domain.com”)
-
AWS Load Balancer controller as the Ingress Controller
- it manages ALB listener rules, target groups, and ACM certificates for Ingresses automatically
- Kubernetes deployments via FluxCD or ArgoCD using Helm charts
- charts include an Ingress template for services that need to be exposed
-
ExternalDNS makes sure the Ingress hostnames are created in Route53 as CNAME records pointing to the ALBs DNS name.