Proper way to deal with transient resources?

bcsgh · February 10, 2021, 9:58pm

To give a specific case (though I’m seeking a general answer), say I’m using DNS validation to validate a x509 cert generate via AWS’s ACM service:

resource "aws_acm_certificate" "example" {
  domain_name = "sub.example.com"
  subject_alternative_names = ["sub.region.example.com"]
  validation_method = "DNS"
  lifecycle { create_before_destroy = true }
}

resource "aws_route53_record" "example" {
  for_each = {
    for dvo in aws_acm_certificate.example.domain_validation_options : dvo.domain_name => {
      name    = dvo.resource_record_name
      record  = dvo.resource_record_value
      type    = dvo.resource_record_type
      zone_id = data.aws_route53_zone.emsicloud.zone_id
    }
  }

  allow_overwrite = true
  name            = each.value.name
  records         = [each.value.record]
  ttl             = 60
  type            = each.value.type
  zone_id         = each.value.zone_id
}

resource "aws_acm_certificate_validation" "example" {
  certificate_arn         = aws_acm_certificate.example.arn
  validation_record_fqdns = [for record in aws_route53_record.example : record.fqdn]
}

Okay, this works, but those DNS records should not exist after the validation is finished and the aws_acm_certificate_validation should not exist after those DNS records are deleted.

Now I run into the problem: terraform seems to go to significant lengths to prevent me from automating the correct sequnce of operation for handling those resource. If make the resources be created when aws_acm_certificate.example.status=="PENDING_VALIDATION" and then destroyed when that becomes ISSUED, then all I should need to do is keep applying the config until it stabilizes. But I get an error message to the effect of:

The "count/for_each" value depends on resource attributes that cannot be determined
until apply, so Terraform cannot predict how many instances will be created.

What is the intended way for terraform to deal with this sort of resource, where it should only exist at all during particular stages of deployment and then be removed? Or more generally; what is the proper way to deal with a case where the number of resource to be created can only be determined by inspection of the external result of applying other parts of the configuration?

I’m well aware that the terraform philosophy is that configurations should describe only the static end state and an apply should jump right to it. However in some cases, including this one, reality doesn’t allow that; the only way to get to the proper end state is to walk thought a number of intermediate steps.

Currently I’m working around this by manually manipulating a variable, but that’s error prone as people who don’t know about it will do the wrong thing regardless of what defaults I set.

Another option would be to do all that work in a provisioner, but then I lose all the advantages of terraform with regards to the the associated resources.

Another solution I have reason to suspect would work would be something disgusting like:

variable "pending_validation" { default = false }

resource "local_file" "foo" {
    content = "pending_validation=${(aws_acm_certificate.example.status == "PENDING_VALIDATION")}"
    filename = "pending_validation.auto.tfvars"
}

and then switch everything based on var.pending_validation.

Does terraform really have no answer to this uses case?

apparentlymart · February 11, 2021, 12:35am

Hi @bcsgh,

As you’ve noted, Terraform’s core model doesn’t include any sense of ephemeral or transient objects: at the core runtime’s level of abstraction, we’re describing a desired end state where a set of objects ought to exist; it’s typically the provider’s responsibility to deal with whatever oddities the underlying API might present that disagree with this model, which tends to involve adding resources that don’t directly map to underlying API operations but instead combine together a series of steps into a single operation which ends in the final desired state.

I’m not very familiar with this part of the AWS provider, but from looking briefly at the implementation of aws_acm_certificate_validation I see that it is the sort of “virtual resource type” I was describing above, where it doesn’t really map to a real object in the AWS API but instead serves to adapt an AWS API operation to behave as if it were a long-lived object Terraform Core can understand. As far as I can tell, it doesn’t actually represent a distinct object in the AWS API, but instead just models the event of a certificate becoming validated so that other parts of your module can depend on that operation having completed.

However, it does seem to be true – unles there’s another part of the provider API that I’m unaware of – that the provider design assumes that you’ll keep the DNS record in your DNS zone indefinitely, rather than delete it after the validation completes. Given that, I don’t see any obvious way to achieve the result you are looking for, though perhaps someone more familiar with this part of the provider can suggest something.

bcsgh · February 11, 2021, 3:06am

As I noted at that start of my question: this is but one case where there may be need for a ephemeral or transient instance of a kind of resource that a terraform provider (very justifiably) provides first class support for.

Another one is an aws_instance that exists only as the basis for a aws_ami_from_instance. Once the AMI is generated the EC2 instance can, should and (probably, for billing reasons) must be shut down. I have a case where I’m doing that via some “interesting” hacks.

Given I know of two example off the top of my head and within a few small corners of a single provider, I strongly suspect that this kind of things is not uncommon. Given that terraform almost supports these operations natively, I suspect it’s options are to provide proper support or accept that people who continue to use the tool are going to hack around these restriction. Expecting people to alter there goals to conform to how terraform wants to be used is not a realistic expectation.

If terraform would require a significant change in it’s model to make this work then I’d like to throw something into the arena of consideration to get the conversation going.

Add three features:

Add an object class, along side resource and data, for sequencing: something that would act like a resource as far as dependencies and lifetimes go, but where the intended use it so wrap wait-on-condition operations. This would serve much the same role as some classes of provisions, but could be provided directly and cleanly as part of a provider. (Other AWS example for this would be: waiting for an EC2 instance to join a load balancer target group and go healthy, waiting for an aws_ami_from_instance to become available, etc.) I’m currently dealing with both of these exact case using esoteric bash commands in a provitioner. Being able to outsource that complexity to the provider would be very nice.)
Add a new class of provisioner that temporarily creates other generic resources. Terraform would not consider the parent resource created until the child resources had finished being created, and would then destroy them at “some well defined point in time”.
Add the option for a provisioner to specify when = before_create and let that capture the result so that the enclosing resource can use it in it’s parameters. (This would act much like depending on a null_resource with a provisioner, but would be IMHO cleaner.)

Between those three feature proposals, I think I could avoid the bulk of the hacks I’ve found myself considering/implementing.

apparentlymart · February 11, 2021, 5:40pm

For aws_ami_from_instance in particular, since I happen to be the person who originally implemented that I can at least say that at the time we did it (many years ago, now) the model for what is and is not a good application of Terraform was still immature, and so that resource type came more out of a sense of completionism while implementing the more useful aws_ami_copy than with any particular use-case in mind.

Indeed, HashiCorp Packer exists as a separate tool from Terraform largely because it has a different workflow and model that is centered around using transient objects to build long-lived artifacts. The discussion of introducing some similar sense of transient objects into Terraform as certainly arisen before, both in isolation like this and also in the grander sense of “what if Packer and Terraform were merged?”.

I think the honest truth is that there isn’t currently the bandwidth to delve into that sort of revolutionary-type design work right now, since the Terraform team is currently focused on getting the current features in good shape for a 1.0 release with longer-term support, and the Packer team is doing a great job of working independently on their separate slice of the build/provision problem space.

I suspect that in the long run there will be more opportunities to discuss convergence, including dealing with some more varied use-cases like your AWS ACM certificate which I think would be a pretty strange application for Packer (though I expect technically not impossible to represent as a Packer builder, just unusual). But I wouldn’t expect to be able to engage with this in a lot of detail for a while yet.

bcsgh · February 11, 2021, 6:39pm

To keep idea #1 from going missing: Provider supplied provisoner implementation. · Issue #27748 · hashicorp/terraform · GitHub

ohmer · March 1, 2021, 12:42pm

You might have not seen why you should keep the DNS entry:

If ACM cannot automatically validate a domain name, it notifies the domain owner that manual action is needed to validate the domain and complete certificate renewal. These notifications are sent at 45 days, 30 days, seven days, and one day prior to expiration. The most common reason for automatic validation to fail is that the required CNAME has been inadvertently changed or removed.

(source: Renewal for domains validated by DNS - AWS Certificate Manager)

bcsgh · March 1, 2021, 4:47pm

A valid point, however:

There is an issue with leaving them around: If you have multiple certificates with multiple domain names that overlap (say a different cert with example.com and ${REGION}.example.com for each of several region specific configurations) then the CNAME for the overlapping domain ends up (IIRC) being the same and then; which configuration owns that resource? Or if you make multiple CNAME records; can you be sure that the correct value gets resolved when the ACM-validater does it’s lookup? Dos that even work in general, e.g. if the certs aren’t per AWS region?

These are not insurmountable issues, but they impose further constraints and add complexity. The simplest solution is to have the validation records be transiently created by each configuration only when needed. Then, all that is needed is to re-apply the configuration at a regular interval, which IMHO is something that should be happening already.

stuart-c · March 1, 2021, 5:12pm

I would expect that in the normal situation the CNAME records would persist. One big advantage of ACM with DNS validation is that renewals are seamless - you get an email confirming the certificate was updated, but there is no downtime risk or action required.

If you use ACM via the web UI there is a button to add the CNAME into Route53 but there isn’t anything to remove it (you’d have to go to the zone yourself and find the correct entry), as the expectation is again that you create the record and leave it there forever. There is also no guarantee that a certificate will instantly be issued - it can take a while, so you don’t know when it would be “safe” to remove the CNAME.

Certificates which require the same CNAME record can be an issue (I don’t know what details are used when generating the name and so what might cause overlaps), but this would still be a problem even if you has this idea of a transient resource - it would be quite possible for the creation of two certificates to overlap.

bcsgh · March 1, 2021, 7:04pm

I suspect you are correct with regards to AWS/ACM’s expectations. The issue is how to make that cleanly and reliably work with Terraform’s expectations?

TF already deals with the “is ACM done issuing the cert” issue via a aws_acm_certificate_validation sudo resources. And transient resources would also work correctly (hypothetically) because a TF config should never delete a record that it didn’t create and should fail if it attempts to create an already existing record. Worst case, you can only run a single apply at a time, but if each configuration apply is retried until it gets to a successful no-op state, then things will converge correctly.

In this specific case (which is not really the only type of interesting resource) things may generally work assuming: 1: the same CNAME value gets used by every cert-validation, 2: each overlapping domain can be tied to a region and 3: nothing complains about using latency routed CNAME records. I suspect all three of those are true, but is that by chance or by design? If by chance, I’d rather not build something that depends on that no changing.

ohmer · March 1, 2021, 9:21pm

I don’t see an issue with Terraform or ACM here. The DNS validation entry can only overlap in a specific situation which is a wildcard certificate.

Look at the table at the bottom of Option 1: DNS Validation - AWS Certificate Manager

bcsgh · March 1, 2021, 10:13pm

The only case where different domains names overlap is that case. But I’m thinking of the case where different certs include the same domain name. (For example say that each region has an ALB addressed by a per-region record and latency-based-routing CNAME records are used to direct traffic to the closest healthy instance. Then each ALB needs a certificate including every domain name that resolve to it, which will be different for each ALB, but with overlap between them.)

As best I can tell, if you create two certs (e.g. in two different configurations) each with two different domains:

Cert A: example.com, us-east-1.example.com
Cert B: example.com, us-west-1.example.com

The requested CNAME records for validating ownership of example.com will be identical for both certs. Which configuration should own that resource? Or if they are in the same config (say multiple instantiations of the same module for different regions) now which one instantiation should provide that record?

AWS/ACM doesn’t assume anything about how the records will be created, managed or owned, so it doesn’t have any issue with something being required/shared by multiple other things. But a good chunk of the point of TF is exactly that ownership which then becomes an issue.

I don’t think there is a good solution here, but allowing more options improves the likelihood that one of those options will fit the specific needs of any given end user.

stuart-c · March 1, 2021, 10:56pm

While multiple certificates with overlapping validation CNAMEs is an issue, to some extent this can be limited by the careful selection of the primary name and SANs.

This is however a different issue to the one you were initially discussing - the DNS records should continue to exist after the initial validation to allow for renewals, so the concept of a transient resource wouldn’t help.

ohmer · March 1, 2021, 11:37pm

Ah ok, I think got it now.

I think there is a solution but not with ACM and managed renewals.

When using ACM and managed renewals, you get a cert signed by AWS trusted root and it’s easy to deploy it to integrated service (Services Integrated with AWS Certificate Manager - AWS Certificate Manager). Since it is an extension of their trust domain, they can’t let customers access the private key otherwise it would break the trust chain. ACM does not offer, by design, the capacity to copy a managed certs across account or region. That’s probably why you went for a certificate per region with SAN.

You could create your own key pair and import it in ACM to expose it via integrated services. it’s not trivial because you need an external CA and have to deal with secrets management (Vault as an intermediate CA or Vault as a secret manager is a way of doing this). You will have to implement the renewal yourself or make yourself the promise that you will be better than Google and Microsoft at not forgetting to renew the certificate in time

Maybe AWS Global Accelerator will not lead to this ACM limitation.

All that said, this is not a Terraform or IaC problem but a design with AWS problem.

bcsgh · March 1, 2021, 11:44pm

@stuart-c: As far as I know, ACM validation doesn’t distinguish between the primary and alternate names; it requires validation of all of them independent. Swapping those around won’t make any difference.

For the sake of argument, lets assume you are correct however about leaving the resource up: which TF configuration should own the record for the common domain name? The available answers are:

One of them, arbitrarily selected.
All of them.
None of them.
Something else.

Now what happens when some of the certificates are to be deleted? Regardless of which above choice is selected, some cases go wrong:

If the arbitrarily selected cert is deleted the others will fail to validate.
Regardless of which cert is deleted, the others will fail to validate.
The validation will remain active even after the last cert is deleted.
Have fun keeping things in sync

Using transient records avoids those issues, but trades them for a different set of problems when it comes time to renew. I’m personalty more inclined towards that second set of problems as it can be mode more contained: the problem only exists when you see it.

Another point: if you choose the leave the records up, then by default anyone with access to ACM (via your account) can generate a cert including that domain, even if they don’t have access to the DNS config. Depending on situational details, that may or may not be a problem.

Practically, the best solution here would be for AWS to define that the DNS name/value parts are always a one-to-one mapping, the only cases where names match is where the value also matches. With that, duplicate CNAME records are no longer a risk, each cert’s config can own its own copies of all the needed validations.

bcsgh · March 1, 2021, 11:58pm

@ohmer actually the reason for the different certs is the expectation that the cert should exactly match the list of domains that could resolve to that server/ALB/endpoint/etc. Even with externally generated certs that can be installed in multiple regions, I’d still go with the same setup.

Also, there is a TF issue here: for things to work correctly, multiple different configurations, at various points in time, require the same resource to exist. Which config should own that resource? The ACM/CNAME case is just a concrete example, and I suspect not the only one.

The options for dealing with this case are, unless I’m missing one:

Create the resource transiently by TF only while needed.
Allow “shared ownership” between TF configs of the same resource.
Figure out a way to create multiple “equivalent” resource that the external infra doesn’t bother differentiating between.

#1 and #2 are the only ones that TF can actually do anything about. (And there are cases where #2 is the only viable option, e.g. where external constraints demand the resource must exist persistently. But that is a way more complicated ask, so I didn’t.)

stuart-c · March 2, 2021, 1:31am

How would you know how long a resource is needed?

bcsgh · March 2, 2021, 1:59am

Which resource? The certs them self? That’s presumed to be “otherwise already known”.

The CNAME records? That’s the $1k question. It should be “as long as any cert, in any config, needs it”. But that’s a really hard problem to resolve when you don’t have any way of telling what other configs exist. I don’t know how to solve that. Transient resources in this case are a work around for that problem: rather than keeping them “as long as needed” it switches to deleting them “as soon as possible”.

Of course that presumes the cert/CNAME case, and different reasoning would apply for other uses.

Or are you asking in general at what point in the apply cycle transient resources would be destroyed? I suspect that would require some tuning parameters, but I’d guess the default should be as a phase after all the permanent resources finish being created; all creates come before all deletes.

ohmer · March 2, 2021, 3:07am

@bcsgh The problem is not Terraform here but ACM limitation/design/feature. You can try harder to bend Terraform but I don’t think your proposals are valid.

ohmer · March 2, 2021, 3:09am

Disclaimer: I am not an HashiCorp employee, just a long time user (since 0.6) and been building on AWS for many years.

bcsgh · March 2, 2021, 4:34am

I understand your opinion. I believe you are wrong.

I assert that, as far as Tarraform should be concerned, every problem is either a Tarraform problem, or something that Tarraform chooses to not try to solve. Neither of those is “the infrastructure’s problem/limitation/etc”.

What I’m considering here is a configuration problem. One that can be solved if no constraints are placed on how I interact with AWS. That puts is squarely in the scope of Terraform’s claimed mandate. It may be off in a corner, but it’s clearly in the class of problem that Terraform claims it is trying to solve.

If Terraform considers “dealing with valid design choices by infrastructure that it purports to support managing” as something it can decline to do by saying the problem is in someone else’s infrastructure, then I suspect Terraform will not have a long term future. (And limiting to “valid choices” is arguably a mistake.) OTOH acknowledging that it’s a limitation of Terraform for which no acceptable solution has been found? That can be workable. (Something being your fault is good because you can fix that. If someone else is at fault, you are dependent on them caring enough to fix it.)

Tools like Terraform don’t have the option of deciding what people want to use them for, nor what people need to use them for. At best, they can chose which people will uses them at all. Choose too narrowly and nobody uses it, too broadly and it becomes feature soup. The happy median is to decide which user to turn away, because their use case adds too much complexity for the other users, or because there is another product that does a better job, or that’s not the class of problem being solved, etc.

Topic		Replies	Views
Aws_acm_certificate domains when there's more than one aws_acm_certificate.this Terraform	0	683	September 20, 2020
AWS ACM certificate with domain validation AWS	0	1707	September 7, 2022
How to read values of resource created by Provider A in step that needs Provider B Terraform	6	782	August 9, 2022
Aws_acm_certificate.app_cert.domain_validation_options is a set of object, known only after apply AWS tf-aws-provider-release	1	2256	August 1, 2022
Delete a resource once another resource attribute updated to a certain value Terraform	11	85	February 12, 2025

Proper way to deal with transient resources?

Related topics