Terraform plan indicating Cycle error, but no cycles in graphviz?

When I run terraform plan, I get:

╷
│ Error: Cycle: aws_instance.ept01, aws_instance.ept01 (expand)
│
│
╵
╷
│ Error: Cycle: aws_instance.est01, aws_instance.est01 (expand)
│
│

When I run terraform graph -draw-cycles -type=plan and paste the output into graphviz online, there aren’t any cycles?

terraform --version
Terraform v1.8.2
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v5.47.0

What would be the next step to determine why terraform thinks there’s a cycle?

Thanks!

Hi @rkulagowski1,

Sorry this error message isn’t very clear. Unfortunately it’s talking about a problem in terms of Terraform’s implementation details, so interpreting it requires some knowledge about how Terraform works internally.

The important information here is that “aws_instance.est01 (expand)” is a graph node that represents the resource block itself, before Terraform knows how many instances of that resource exist. The graph node “aws_instance.est01”, on the other hand, represents the one instance of that resource that results from the resource block using neither count or for_each.

Therefore it seems like what’s going on here is that for some reason Terraform has concluded that the “expansion” of the resource cannot be decided before evaluating the individual instance, but it also can’t evaluate the individual instance before knowing that the individual instance exists, so these two operations are mutually-dependent and there’s no valid order.

This is not a situation I’ve encountered before, and so I’m not sure exactly what to suggest. If you could share the source code of your resource "aws_instance" "ept01" block then hopefully I can use it to understand why Terraform is reaching this strange conclusion.

Here’s what the terraform looks like; please note that this was initially deployed with terraform, so these instances already exist. The change that we’re trying to make is in the security groups, so not sure why we’re getting cycle errors all-of-a-sudden.

resource "aws_instance" "ept01" {
  ami                  = data.aws_ami.ubuntu2204.id
  key_name             = trimsuffix(var.aws_key_name, ".pem")
  instance_type        = "t4g.micro"
  iam_instance_profile = var.iam_instance_profile
  hibernation          = false
  lifecycle {
    ignore_changes = [ami, ebs_optimized, user_data]
  }

  credit_specification {
    cpu_credits = var.cpu_credits
  }

  ebs_optimized = true

  vpc_security_group_ids = [
    var.remote_access_sg,
    var.infrastructure_management_sg,
    aws_security_group.MarketingEC2Access.id,
  ]

  subnet_id                   = "subnet-09fc50xxxxxxxxxxxxxxxx"
  associate_public_ip_address = false

  provisioner "remote-exec" {
    inline = [
      "sleep 3m", # When we do the initial connect, the instance is still bootstrapping its own apt update and you'll see errors if you don't wait.
      "sudo apt update",
      "sleep 1m",
      # "sudo DEBIAN_FRONTEND=noninteractive apt upgrade -y",
      # "sleep 5m", # Smaller instances take longer to do things.
      # "sudo apt autoremove -y",
      # "sleep 1m",
      "sudo hostnamectl set-hostname ept01",
      "sudo apt install joe zip unzip htop php php-sqlite3 php-soap php-curl -y",
      "sleep 3m",
      "sudo service apache2 stop",
      "mkdir /home/ubuntu/planning",
    ]

    connection {
      type        = "ssh"
      host        = aws_instance.ept01.private_ip
      user        = "ubuntu"
      private_key = file("../${var.aws_key_name}")
    }
  }

  provisioner "file" {
    source      = "planning/"
    destination = "/home/ubuntu/planning"

    connection {
      type        = "ssh"
      host        = aws_instance.ept01.private_ip
      user        = "ubuntu"
      private_key = file("../${var.aws_key_name}")
    }
  }

  # By removing the index.html the index.php will automatically be used because of apache2 dir.conf
  provisioner "remote-exec" {
    inline = [
      "sudo tar --strip-components=1 -xzvf /home/ubuntu/planning/ept.tgz -C /var/www/html",
      "sudo rm /var/www/html/index.html",
      "sudo chmod a+rwx -R /var/www/html",
      "sudo phpenmod curl",
      "sudo phpenmod soap",
      "sudo a2enmod remoteip",
      "sudo a2enmod headers",
      "sudo service apache2 start",
    ]

    connection {
      type        = "ssh"
      host        = aws_instance.ept01.private_ip
      user        = "ubuntu"
      private_key = file("../${var.aws_key_name}")
    }
  }

  root_block_device {
    volume_type           = "gp3"
    volume_size           = 32
    delete_on_termination = true

    tags = merge(local.tags, {
      Name            = "ept01"
      Device          = "/dev/sda1"
      Label           = "Root"
      MakeSnapshot    = "False"
      MountPoint      = "/"
      "cpm backup"    = "DailySsComNonProd#initial-ami#app-aware WeeklySsComNonProd#initial-ami#app-aware"
      Application     = "Email Planning Tool"
      ApplicationRole = "Email Planning Tool"
    })
  }

  tags = merge(local.tags, {
    "cpm backup"    = "DailySsComNonProd#initial-ami#app-aware WeeklySsComNonProd#initial-ami#app-aware"
    Name            = "ept01"
    Application     = "Email Planning Tool"
    ApplicationRole = "Email Planning Tool"
  })
}

output "ept01_IP" {
  value = aws_instance.ept01.private_ip
}

resource "aws_route53_record" "ept01" {
  zone_id = "Z03881892DRXXXXXXXXXXXX"
  name    = "ept01.mkt-nonprod.example.com"
  type    = "A"
  records = [aws_instance.ept01.private_ip]
  ttl     = 60
}

Hi @rkulagowski1! Thanks for sharing the configuration.

In each of your aws_instance resource blocks you have provisioner connection blocks that refer to the same resource that they are declared in, which I think might be confusing Terraform because the resource therefore appears to refer to itself.

In provisioner and connection blocks you’re supposed to use self to refer to the instance of the resource that’s currently being provisioned. For example:

    connection {
      type        = "ssh"
      host        = self.private_ip
      user        = "ubuntu"
      private_key = file("../${var.aws_key_name}")
    }

I think the trick here was that aws_instance.ept01 refers to “all instances of this resource”, which in this case is just the one singleton instance, but nonetheless Terraform thinks the connection configuration can’t be evaluated until Terraform has decided how many instances of this resource exist (which is what the “expand” node does), because that result decides the type and value of aws_instance.ept01.

I’ve shown just one connection block here as an example, but the same would apply to the other one too.

Switching to “self” worked - but this seems like it’s something new, because the instances had been launched with a previous version of the AWS provider / terraform v1.7.3 and didn’t cause an issue. Would this be something that I raise as a documentation bug, or a terraform bug, or just tell our internal teams that effective “now”, if you’re using a provisioner to start using “self” because the old way wasn’t a best practice?

Thanks for helping track this down.

Hi @rkulagowski1,

I’m afraid that if this was working in an earlier version of Terraform then I’m not sure why it was working on when that changed.

The trick here, I think, is that the provisioner and connection blocks are treated as belonging to individual instances of a resource because when a resource uses count or for_each they must be evaluated separately for each one. In that case a reference to the entire resource, rather than just to the current instance using self, would cause all of the instances to depend on one another because the value of the expression consists of all of the instances together as a single tuple or object.

I can imagine this might have worked in Terraform v0.11 and earlier because they didn’t yet support referring to the full set of instances of a resource as a single value. The generalization made in Terraform v0.12 and later to allow (for example) passing the full set of instances of a resource at once to another module caused Terraform to treat an expression like aws_instance.ept01.private_ip as referring to the resource itself, rather than to an individual instance of it (even if there happens to be only one, as in this case). Prior to Terraform v0.12 there was no such distinction, and so this may have accidentally worked back then.

This having worked in Terraform v1.7 surprises me more because that came long after Terraform v0.12, but if you can confirm that this did work in v1.7.3 and no longer works in v1.8 then this may be an example of a late-reported regression (one of the considerations for our v1.x Compatibility Promises), in which case we’ll need to decide how best to resolve it. If you can show that to be true, please open a bug report issue in the Terraform CLI/Core repository to share it with the team that maintains Terraform Core.

OK - I have a state file which indicates terraform v1.5.0, so I used that as my test.

If I run terraform plan with terraform v1.5.7: no cycle error.
Anything beyond v1.5.7 generates the cycle error.

I take it that this means that I should report the error?

Thanks for testing that!

Indeed, if you would like the Terraform Core team to investigate what changed here, please open a GitHub issue with some information on how you tested this, so the team can retrace your steps.

Since this happened back in the v1.5 series this even more of a “late-reported regression” than I originally thought, so in this case the team will need to determine what caused this change in behavior and decide what best to do about it.

I assume this regressed because this wasn’t a known-working pattern (so it’s not covered by any tests) and because some new feature was added that caused Terraform to treat the situation differently. One possible outcome then is that continuing to support that new feature is more important than restoring the accidentally-broken behavior, because either way this would be a breaking change for someone and so the team must make a judgement call about which breakage is likely to have the most impact.

Hopefully though the team can find a compromise that allows fixing the accidental regression without simultaneously causing another one. It’s hard to say what the situation is without some further investigation.