Say you use vault for generation of temporary (and therefore expirable) credentials.
Whatever the cloud provider/platform/… you are using and therefore the terraform provider, if one resource creation takes longer than the expiration time the next instruction will fail and the state will not be updated. Retrying the apply phase will most likely fail with “resource already exists”
To reproduce this case, use AWS with a TTL of 20mn for vault temp creds and try to provision an elastic search instance.
Firs time terraform apply
is used I get
aws_elasticsearch_domain.logs[0]: Still creating... [18m50s elapsed]Error: waiting for Elasticsearch Domain (arn:aws:es:eu-west-3:981467355511:domain/logs-baba1-964a3684-a593-46) create: AccessDeniedException: User: arn:aws:iam::981467355511:user/vault-root-own-creds-role-1694024302-PWBwWzosf8JBCnQnFOZZ is not authorized to perform this operation
on logging.tf line 35, in resource "aws_elasticsearch_domain" "logs":
35: resource "aws_elasticsearch_domain" "logs" {`
Second time terraform apply
is run I get
aws_elasticsearch_domain.logs[0]: Creating...Error: Elasticsearch Domain (logs-baba1-964a3684-a593-46) already exists
on logging.tf line 35, in resource "aws_elasticsearch_domain" "logs":
35: resource "aws_elasticsearch_domain" "logs" {
What is also strange is that if after the 1st or second run you run terraform destroy
the resource is properly removed which seems incoherent with what was just described above
Could someone please shed some light on this ?
PS: if have search and read a lot about similar subjects which mention also using terraform refresh
but nothing seems to adress this issue as a whole
Many thanks
Hi @olivier.bourdon,
The specific situation you described here seems like a subtle bug in the hashicorp/aws
provider’s implementation of aws_elasticsearch_domain
.
If the subsequent apply complains that the object already exists then that suggests that the first apply at least partially succeeded. If that is true then the correct provider behavior would be to return an object describing the partial state, which Terraform Core would then retain until your second plan and apply so that Terraform can react to the incomplete operation, typically by proposing to destroy the incomplete object and create a new one in its place.
To recover from this incorrect situation I think the best approach would be to manually approximate the result the provider ought to have created here, by first importing the existing object into Terraform and then marking it as damaged (“tainted”) so Terraform knows to distrust its state on the next run:
terraform import "aws_elasticsearch_domain.logs[0]" "logs-baba1-964a3684-a593-46"
terraform taint "aws_elasticsearch_domain.logs[0]"
In the above I’ve assumed that the logs-baba1-964a3684-a593-46
string in the error message is your specified “domain name” for this object, which is what this resource type expects as an import id.
After running the above you should be able to run a normal plan and apply and see Terraform propose to replace the incomplete object with a fresh one. If your credentials live long enough on the second run then it should hopefully succeed.
I’d also suggest reporting the original bug in the AWS provider’s GitHub repository, and then hopefully the maintainers can improve it to handle this incomplete creation situation automatically rather than requiring this manual workaround.
Many thanks for this quick answer, I will definitely have a look into the terraform provider as I already contributed fixing some bugs. As this seemed more an issue with state and seing other similar issues I (wrongly) thought it was related to Terraform core rather than AWS provider. Will keep this thread posted on further findings.
Many thanks again
After digging into the AWS provider code but also looking deeper into the state obtained after the 1st error creating the ES domain with expiring credentials during the process I found out that the state seems to already contain the proper information
{
"mode": "managed",
"type": "aws_elasticsearch_domain",
"name": "logs",
"provider": "provider[\"registry.terraform.io/hashicorp/aws\"]",
"instances": [
{
"index_key": 0,
"status": "tainted",
"schema_version": 0,
...
"domain_name": "logs-baba1-964a3684-a593-46",
This tends to explain why the terraform destroy
step works and cleanups up the corresponding resource.
Please also note that the terraform version in use is 0.14.11 and AWS provider 4.67.0
Will keep digging on my side but in the meantime if someone has some further clues
@apparentlymart
When a bit further in my trials. I have added the following to the ES domain creation TF section:
--- a/terraform/templates/aws/0.14/logging.tf.tpl
+++ b/terraform/templates/aws/0.14/logging.tf.tpl
@@ -68,6 +68,10 @@ resource "aws_elasticsearch_domain" "logs" {
UserUid = var.sqsc_user_uid
Domain = local.es_logs_domain
}
+
+ timeouts {
+ create = "15m"
+ }
}
which indeed produce the expected timeout in the terraform apply
1st call:
aws_elasticsearch_domain.logs[0]: Still creating... [15m0s elapsed]Error: waiting for Elasticsearch Domain (arn:aws:es:eu-west-3:981467355511:domain/logs-baba1-964a3684-a593-46) create: "logs-baba1-964a3684-a593-46": Timeout while waiting for the domain to be created
on logging.tf line 35, in resource "aws_elasticsearch_domain" "logs":
35: resource "aws_elasticsearch_domain" "logs" {
this time, the TF code expires before the AWS temporary credentials
Checking the generated TF state show exactly the same contents than previously aka resource is present in tainted state
Relaunching terraform apply
produce the exact same error therefore, it seems indeed like a bug in the TF AWS provider (4.67.0) for creating ES domains