I suspect you are correct with regards to AWS/ACM’s expectations. The issue is how to make that cleanly and reliably work with Terraform’s expectations?
TF already deals with the “is ACM done issuing the cert” issue via a aws_acm_certificate_validation sudo resources. And transient resources would also work correctly (hypothetically) because a TF config should never delete a record that it didn’t create and should fail if it attempts to create an already existing record. Worst case, you can only run a single apply at a time, but if each configuration apply is retried until it gets to a successful no-op state, then things will converge correctly.
In this specific case (which is not really the only type of interesting resource) things may generally work assuming: 1: the same CNAME value gets used by every cert-validation, 2: each overlapping domain can be tied to a region and 3: nothing complains about using latency routed CNAME records. I suspect all three of those are true, but is that by chance or by design? If by chance, I’d rather not build something that depends on that no changing.
The only case where different domains names overlap is that case. But I’m thinking of the case where different certs include the same domain name. (For example say that each region has an ALB addressed by a per-region record and latency-based-routing CNAME records are used to direct traffic to the closest healthy instance. Then each ALB needs a certificate including every domain name that resolve to it, which will be different for each ALB, but with overlap between them.)
As best I can tell, if you create two certs (e.g. in two different configurations) each with two different domains:
The requested CNAME records for validating ownership of example.com will be identical for both certs. Which configuration should own that resource? Or if they are in the same config (say multiple instantiations of the same module for different regions) now which one instantiation should provide that record?
AWS/ACM doesn’t assume anything about how the records will be created, managed or owned, so it doesn’t have any issue with something being required/shared by multiple other things. But a good chunk of the point of TF is exactly that ownership which then becomes an issue.
I don’t think there is a good solution here, but allowing more options improves the likelihood that one of those options will fit the specific needs of any given end user.
While multiple certificates with overlapping validation CNAMEs is an issue, to some extent this can be limited by the careful selection of the primary name and SANs.
This is however a different issue to the one you were initially discussing - the DNS records should continue to exist after the initial validation to allow for renewals, so the concept of a transient resource wouldn’t help.
I think there is a solution but not with ACM and managed renewals.
When using ACM and managed renewals, you get a cert signed by AWS trusted root and it’s easy to deploy it to integrated service (Services Integrated with AWS Certificate Manager - AWS Certificate Manager). Since it is an extension of their trust domain, they can’t let customers access the private key otherwise it would break the trust chain. ACM does not offer, by design, the capacity to copy a managed certs across account or region. That’s probably why you went for a certificate per region with SAN.
You could create your own key pair and import it in ACM to expose it via integrated services. it’s not trivial because you need an external CA and have to deal with secrets management (Vault as an intermediate CA or Vault as a secret manager is a way of doing this). You will have to implement the renewal yourself or make yourself the promise that you will be better than Google and Microsoft at not forgetting to renew the certificate in time
Maybe AWS Global Accelerator will not lead to this ACM limitation.
All that said, this is not a Terraform or IaC problem but a design with AWS problem.
@stuart-c: As far as I know, ACM validation doesn’t distinguish between the primary and alternate names; it requires validation of all of them independent. Swapping those around won’t make any difference.
For the sake of argument, lets assume you are correct however about leaving the resource up: which TF configuration should own the record for the common domain name? The available answers are:
One of them, arbitrarily selected.
All of them.
None of them.
Now what happens when some of the certificates are to be deleted? Regardless of which above choice is selected, some cases go wrong:
If the arbitrarily selected cert is deleted the others will fail to validate.
Regardless of which cert is deleted, the others will fail to validate.
The validation will remain active even after the last cert is deleted.
Have fun keeping things in sync
Using transient records avoids those issues, but trades them for a different set of problems when it comes time to renew. I’m personalty more inclined towards that second set of problems as it can be mode more contained: the problem only exists when you see it.
Another point: if you choose the leave the records up, then by default anyone with access to ACM (via your account) can generate a cert including that domain, even if they don’t have access to the DNS config. Depending on situational details, that may or may not be a problem.
Practically, the best solution here would be for AWS to define that the DNS name/value parts are always a one-to-one mapping, the only cases where names match is where the value also matches. With that, duplicate CNAME records are no longer a risk, each cert’s config can own its own copies of all the needed validations.
@ohmer actually the reason for the different certs is the expectation that the cert should exactly match the list of domains that could resolve to that server/ALB/endpoint/etc. Even with externally generated certs that can be installed in multiple regions, I’d still go with the same setup.
Also, there is a TF issue here: for things to work correctly, multiple different configurations, at various points in time, require the same resource to exist. Which config should own that resource? The ACM/CNAME case is just a concrete example, and I suspect not the only one.
The options for dealing with this case are, unless I’m missing one:
Create the resource transiently by TF only while needed.
Allow “shared ownership” between TF configs of the same resource.
Figure out a way to create multiple “equivalent” resource that the external infra doesn’t bother differentiating between.
#1 and #2 are the only ones that TF can actually do anything about. (And there are cases where #2 is the only viable option, e.g. where external constraints demand the resource must exist persistently. But that is a way more complicated ask, so I didn’t.)
Which resource? The certs them self? That’s presumed to be “otherwise already known”.
The CNAME records? That’s the $1k question. It should be “as long as any cert, in any config, needs it”. But that’s a really hard problem to resolve when you don’t have any way of telling what other configs exist. I don’t know how to solve that. Transient resources in this case are a work around for that problem: rather than keeping them “as long as needed” it switches to deleting them “as soon as possible”.
Of course that presumes the cert/CNAME case, and different reasoning would apply for other uses.
Or are you asking in general at what point in the apply cycle transient resources would be destroyed? I suspect that would require some tuning parameters, but I’d guess the default should be as a phase after all the permanent resources finish being created; all creates come before all deletes.
I understand your opinion. I believe you are wrong.
I assert that, as far as Tarraform should be concerned, every problem is either a Tarraform problem, or something that Tarraform chooses to not try to solve. Neither of those is “the infrastructure’s problem/limitation/etc”.
What I’m considering here is a configuration problem. One that can be solved if no constraints are placed on how I interact with AWS. That puts is squarely in the scope of Terraform’s claimed mandate. It may be off in a corner, but it’s clearly in the class of problem that Terraform claims it is trying to solve.
If Terraform considers “dealing with valid design choices by infrastructure that it purports to support managing” as something it can decline to do by saying the problem is in someone else’s infrastructure, then I suspect Terraform will not have a long term future. (And limiting to “valid choices” is arguably a mistake.) OTOH acknowledging that it’s a limitation of Terraform for which no acceptable solution has been found? That can be workable. (Something being your fault is good because you can fix that. If someone else is at fault, you are dependent on them caring enough to fix it.)
Tools like Terraform don’t have the option of deciding what people want to use them for, nor what people need to use them for. At best, they can chose which people will uses them at all. Choose too narrowly and nobody uses it, too broadly and it becomes feature soup. The happy median is to decide which user to turn away, because their use case adds too much complexity for the other users, or because there is another product that does a better job, or that’s not the class of problem being solved, etc.
For this specific case “as soon as possible” is still complex - for the normal case of wanting to be able to use the standard ACM feature of auto-renewals that is “forever”.
Even if you didn’t want that, the moment the the CNAME isn’t needed could very easily be a time when Terraform isn’t running. For example assume a DNS zone is not yet delegated, and therefore isn’t yet queryable by a normal resolver. Terraform can happily create the certificate and CNAME record, but the validation will not yet succeed. That part might timeout. At a later point the zone delegation is completed so AWS can now validate the certificate and issue it. At that point the CNAME is “no longer needed” for this validation process, but Terraform isn’t running.
I refer you back to the use of aws_acm_certificate_validation and the idiom of “keep looping on apply until it becomes a no-op”. Between those, “as soon as possible” becomes well defined and reasonably simple to accomplish.
The same pattern would also work with temporary resources and provider implemented provisoners: if things timed out then things would be left in the last state, and a subsequent apply would continue from that state. The main difference would be that with these features, ignoring timeouts and other transient failures, success happens in a single apply rather than after many. Though I’d still re-apply till I get a no-op as validation.
Side note: what is the accepted practice for dealing with configurations that need to react to changes outside the control of Terraform? Say the result of data lookups changing? The best solution I’m aware of would be to schedule periodic applys, which would be mechanically very similar to the apply-till-no-op from above.
Sounds a little harsh in addition to other very assertive statements while have no problem to agree to disagree… I am happy we have disagreed actually, it challenges my thinking. I will try to find sometime to support my idea with a more practical HCL example and might learn something in the process. I try to stick to ideas and proposal evaluation and less to the person, but your use case is interesting to me, so let me throw a couple more ideas, hoping I won’t be wrong and those will sound valid or invalid to you.
Periodic applies via continuous deployment/schedule script are a common practice in my experience. Terraform is designed to run in automation while you can break this depending on how your environment. An example is to force MFA on the credentials used by pipeline/cron job. Underlying AWS go SDK will request a MFA token on the standard input and fail. Some folks also do use local providers which works fines on a CLI but may fail in a pipeline (you might have no write permissions on $CWD and try to write a template file).
Periodic plans are also common to detect drift. In a non enforced GitOps world, manual modification can happen. Depending on how you write your templates, some drift may exists on the provider but not for Terraform. An example of that it describing an AWS security group and their rules as seperate resources. If somebody adds a new rule in the console, there is no drift for Terraform. Separate group/rule resources describes the presence of a security group and a rule but does not mean that the rule is the only one of the group. driftctl does that and other things.
driftctl sounds like an interesting project. It sounds like an attempt to solve at least part of a problem I’ve worked with before. I’ll have to look into it.
BTW: I didn’t intend my disagreement personally, if you understood it to be I apologize. Also, the issue I was disagreeing with you about is not the technical issues here, it’s the mindset of who gets to defines what is to be done: the product owner or it’s users? Interestingly I’ve seen the same mindset (exaggerated: “this is how things should work and we expect the word to conform to our opinion”) in a number of projects associated with the Go programming language, including Go it self. Personally I think anyone who thinks they can foresee all valid uses of a product is fooling themselves. And anyone who unnecessarily limits their offering to the uses they foresee is hamstringing themselves.
Ultimately it is the product owner/developers who decide what are the valid uses of the product they are creating.
That doesn’t mean there might not be other possible uses, but the creators get to decide which of those to ignore, either explicitly (we don’t want Terraform to ever be able to do X) or not (we’d like to be able to tackle that but don’t currently have the time).
As a user you can potentially influence a project, but you generally have no method to control what others decide to do. At least with Open Source projects you have the right (assuming you also have the time & technical ability) to fork a project to go in a different direction.
Most projects are opinionated in one way or another (some strongly) as from a developer’s perspective having a well defined scope and way of approaching things makes things a lot easier - with the downside for those not agreeing/wanting other things may be out of luck,
(I wouldn’t say things are normally “we expect the world to conform to our opinion” as there is generally no requirement to use a particular tool, but instead “it works this way/handles these cases and if you need anything different we might be open to discussions/contribution, but ultimately we may decide such other ideas aren’t going to be implemented/maintained by us, but you are free to go your own way”)
That is without question true. However some choices those owner could make are better than others. If TF chose to try to add making coffee and mixed drinks to it’s functionality (despite how critical those are to some teams getting things done) I don’t think many would consider that a good idea. Similarly TF choosing to not support resources who’s ID can’t be chosen in advance would cleanly be a bad idea.
Again true, but I do have a very effective way of controlling what the tools I use do: by controlling which tools I use.
And yet again true… but in my experience the scope of what the clients of a product need to do is not really under anyone’s control; it sort of just happens to everyone. If a product makes choices that are too restrictive then most clients will end up needing to work around those restrictions sooner or later.
(Further; the impression I’ve gotten working with things is that choices around restriction have an interesting similarity to Turing completeness: the system can either do almost everything, or almost nothing. And when you have a system that can do almost everything, and is designed to do 99% of what you need to do, people tend to figure out how to hack that last 1% out, despite what the maintainers would wish, and that generally ends up frustrating for all involved.)
How about a real world example of a case where a product tried to make a choice of the type I’m saying is a bad idea? When protocol buffers a while back added oneof, the maintainers of the Go implementation looked at it’s semantics and noticed that it was impossible to implement using POD structs like all prior protobuf implementations had been. So rather than implement that feature, they posted a notice saying they had chosen not too. I never found out how, but a few weeks later, oneof was added to the Go implementation. The mistake that was made by the maintainers was assuming that they could choose what features their users needed; they can’t. In many cases maintainers can choose what features are provided, but not what people need. (In this case I expect there were other in positions of authority that can and did dictate that what was needed would be offered.)
In summery, my view is that most users will have a small but non-zero set of uses that are outside what the maintainers initially intend to support. Limiting the scope of the product to eliminate those needs will cascade into removing most uses cases. If you don’t elimiate those uses, people will “solve” them. The best solution IMHO is to strive for a balance that minimizes the added complexity that is seen when dealing with the common case and at the same time, maximizes the ability of the product to deal with new and novel uses in a sane and contained way (e.g. golang’s unsafe).