Terraform hangs, but gives no indication why

I’ve got an issue with terraform apply never timing out, and when it’s cancelled, there’s no reason given. The rest of the TF is created fine, it’s just this one load balancer listener. Other listeners are also created. The target group is created OK, but this one rule just hangs.

Output:

Terraform will perform the following actions:

  # module.login_lambda.aws_lb_listener_rule.this_https will be created
  + resource "aws_lb_listener_rule" "this_https" {
      + arn          = (known after apply)
      + id           = (known after apply)
      + listener_arn = "arn:aws:elasticloadbalancing:eu-west-2:XXXXXXX:loadbalancer/app/XXXXX/f5820bda91b3d2e2"
      + priority     = 10
      
      + action {
          + order            = (known after apply)
          + target_group_arn = "arn:aws:elasticloadbalancing:eu-west-2:XXXXX:targetgroup/XXXXXX/8da3d82954fedc6f"
          + type             = "forward"
        }

      + condition {
          + path_pattern {
              + values = [
                  + "/account/login",
                ]
            }
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.
module.login_lambda.aws_lb_listener_rule.this_https: Creating...
module.login_lambda.aws_lb_listener_rule.this_https: Still creating... [10s elapsed]
...
module.login_lambda.aws_lb_listener_rule.this_https: Still creating... [2m20s elapsed]
Stopping operation...

Interrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...

╷
│ Error: execution halted
│
│
╵
╷
│ Error: execution halted
│
│
╵
╷
│ Error: creating ELBv2 Listener Rule: operation error Elastic Load Balancing v2: CreateRule, request canceled, context canceled
│
│   with module.login_lambda.aws_lb_listener_rule.this_https,
│   on modules\lambda\main.tf line 68, in resource "aws_lb_listener_rule" "this_https":
│   68: resource "aws_lb_listener_rule" "this_https" {

Hi @neil.scales,

I’m not familiar with this specific resource type, so what I’m going to say here is a broader answer about ways providers tend to behave and AWS services tend to behave, but I can’t know for certain exactly what applies to this situation.

Often AWS services are implemented with an “eventually consistent” design, where a successful result from a “create” API call followed immediately by another operation related to that same object can fail, and so the client needs to wait a little for the remote change to be fully committed and try again.

The AWS provider compensates for that in a few different ways depending on what a particular service API allows. In the worst case an API doesn’t provide any reliable signal for when consistency has been reached and so there is no way to distinguish “object does not exist” from “object isn’t fully committed yet”. The AWS provider tends to work around those situations by politely retrying – assuming that any error means “object isn’t fully committed yet” – but with the downside that if the object really doesn’t exist then the operation will appear to hang in a way like what you saw here, rather than returning a meaningful error.

I looked into the provider source code and found code that seems to follow something like that “worst case” pattern, but a little more selective:

The way I understand this code is that it will keep retrying for up to five minutes if the error is specifically “priority in use”. It treats all other errors as non-retryable, and so this at least narrows down what kind of error might be happening.

Therefore my first guess would be that you already have another object that conflicts with this one, and so the API is returning the “priority in use” error but the AWS provider is assuming this is something that might get resolved by retrying, perhaps because for this API there is some delay after changing priority of an existing rule before the old priority becomes available for use again.

The provider also attempts to read back the object it’s just created and will retry for up to 1 minute for that to succeed, but the error message mentions the CreateRule action in particular, so I don’t think this error would come from that second retry loop because that uses the DescribeRules action.

I hope that helps! I also moved this topic into the AWS provider category because the folks there are more likely to be familiar with specific AWS services, and so hopefully someone else can suggest something more specific than I can.

1 Like

Thanks for the detailed response. I’ve modified the priority to something way away from anything else, and it still gives the same non-error. I’ve just tried waiting for 16 minutes and it still doesn’t return.

If there’s anything else I can do to help figure this out, please ask, as this is going to be a big problem for us if it’s not fixable.

I’ve also tried using an existing priority, and that doesn’t fail with some sort of duplicate error.
Manually creating the listener (via the AWS web site) , completes immediately, so it doesn’t appear to be something that is taking time and requires retrying.

Hi @neil.scales,

Perhaps you could try running this operation with the environment variable TF_LOG_PROVIDER=trace set, which should hopefully emit more detailed logs about what exactly the provider is doing.

I think the AWS provider at least produces log entry for each API request it makes, and so hopefully you will see it repeatedly polling some API action and it returning a specific error which might then give a clue as to where things are getting stuck.

FYI I found the problem. I was passing in the ARN of the load balancer, rather than the ARN of the load balancer listener, when trying to create a listener rule.

Maybe it’s a problem with the AWS API rather than TF/Provider.

Anyway, my problem is now sorted.