Resource Destroy Behaviour Issue

Hi, I have encountered a strange issue and reaching out to the community to see if someone has seen something similar and what the fix is.

Quick overview of my setup:

  1. Two Providers targeting two separate Azure Subscriptions
  2. Two child modules (one for each azure subscription) and initialized from a parent module
  3. Parent module specifies a remote back end which maintains state for all resources in both subscriptions
  4. Provision step is triggered by Git Action workflow
  5. Destroy step is triggered by Git Action workflow

Issue summary:

  1. Provisioning succeeds correctly for all resources in both subscriptions.

  2. If I add a Terraform destroy at the end of the provisioning workflow, the entire infrastructure builds and destroys correctly. No issues when both apply and destroy are run in the same Git Action workflow.

  3. If I provision using one workflow successfully, then trigger the destroy using a dedicated destroy workflow from Git Action, a few things happen and are noted here:

    • All infrastructure from subscription one is destroyed successfully.
    • The remote state file silently remove all resources in the second subscription with the exception of the Windows Virtual Machine network interface object.
    • The azure subscription that doesn’t destroy correctly has:
      - A resource group
      - A VM network interface
      - Two virtual network resources
      - One subnet resource linked to a virtual network
      - One route table resource
      - One Windows VM resource
    • At the end of the provisioning step which is successful, the remote state file has all seven resources for the second subscription. Something seems to be happening that removes the Windows VM, route table, subnet, network and resource group objects from the remote state file. The local state file has all the objects before the destroy is triggered by git actions.
    • The destroy errors with:
Error: deleting Network Interface (Subscription: "--redacted--"
│ Resource Group Name: "isvc-core-demo-rg"
│ Network Interface Name: "azselktclidemo01-nic"): performing Delete: unexpected status 400 with error: NicInUse: Network Interface /subscriptions/--redacted--/resourceGroups/isvc-core-demo-rg/providers/Microsoft.Network/networkInterfaces/azselktclidemo01-nic is used by existing resource /subscriptions/--redacted--/resourceGroups/isvc-core-demo-rg/providers/Microsoft.Compute/virtualMachines/azselktclidemo. In order to delete the network interface, it must be dissociated from the resource. To learn more, see aka.ms/deletenic.

This makes sense as the state file suggests that only a network interface resource logically exists in Azure, but all the other resources still exist, the NIC is in use and therefore cant be destroyed. Expected.

What I don’t understand/cant answer is how/why/what is causing the state file to remove knowledge of the other resources in the second subscription but still keeps knowledge of the network interface resource?

The Git Action triggering the destroy is as follows:

name: Terraform Destroy All
on:
workflow_dispatch:

jobs:
  deprovision:
    runs-on: demo-env-only
    environment: demo
    steps:
      - name: Execute terraform destroy on stage2-base infrastructure
        run: |
          cd Terraform/applications/elk/environments/${{ vars.ENVIRONMENT }}/stage0/demo/infrastructure
          export "ARM_CLIENT_ID=${{ secrets.AZURE_CLIENT_ID}}"
          export "ARM_CLIENT_SECRET=${{ secrets.AZURE_CLIENT_SECRET}}"
          export "ARM_TENANT_ID=${{ secrets.AZURE_TENANT_ID }}"
          export "ARM_SUBSCRIPTION_ID=${{ secrets.AZURE_SUBSCRIPTION_ID }}"
          export "AZURE_STORAGE_ACCOUNT=${{ secrets.AZURE_STORAGE_ACCOUNT }}"
          export "TF_VAR_environment=${{ vars.ENVIRONMENT }}"
          terraform init \
            -backend-config="storage_account_name=${AZURE_STORAGE_ACCOUNT}"
          #terraform state list
          terraform destroy --auto-approve

The Gitaction triggering the provision, which is successful is:

name: Terraform Provision All
on:
workflow_dispatch:

jobs:
  provision:
    runs-on: demo-env-only
    environment: demo
    steps:
      - uses: actions/checkout@v3
      - name: Provision Azure Environment
        run: |
          cd Terraform/applications/elk/environments/${{ vars.ENVIRONMENT }}/stage0/demo/infrastructure
          export "ARM_CLIENT_ID=${{ secrets.AZURE_CLIENT_ID}}"
          export "ARM_CLIENT_SECRET=${{ secrets.AZURE_CLIENT_SECRET}}"
          export "ARM_TENANT_ID=${{ secrets.AZURE_TENANT_ID }}"
          export "ARM_SUBSCRIPTION_ID=${{ secrets.AZURE_SUBSCRIPTION_ID }}"
          export "AZURE_STORAGE_ACCOUNT=${{ secrets.AZURE_STORAGE_ACCOUNT }}"
         -- REDACTED ADDITIONAL VARIABLE EXPORT FOR BREVITY --
          terraform init -backend-config="storage_account_name=${AZURE_STORAGE_ACCOUNT}"
          terraform plan -out provision.plan
          #terraform plan -target azurerm_windows_virtual_machine.windows_testclient_01 -out provision.plan
          terraform apply provision.plan
          #terraform apply -refresh-only -auto-approve
          #terraform destroy --auto-approve
          rm -rf .terraform

To clarify, if the destroy runs via the same provision, everything gets tidied up immediately after provision. The issue appears to occur when triggering the destroy independent of the Git Action used for provisioning.

It’s strange that the virtual machine that is dependent on the interface, which depends on the subnet, that depends on the virtual network, that depends on the resource group isn’t trying to destroy the virtual machine, and then the interface, then the virtual subnet, then the virtual network, then the resource group. (After the non-related vnet not linked to the VM and route table not linked to the VM) is destroyed.

If I run a terraform state list just before terraform destroy, it shows all the resources after the terraform init - so I sort of gather the state file is getting messed up when terraform destroy is run. Why? :frowning:

Extra things I have tried/information:

  • Terraform version: required_version = “>=0.31”
  • AzureRM version = !>3.75.0 (have tried downgrading this to 3.71.0 before dependency updates, same behavior).
  • Removed provider aliases for each subscription, same behavior, manual invoked destroy still fails
  • If I run the terraform plan -out destroy.plan and look at the contents, all resources are included.
  • Provisioning shows “Plan: 61 to add, 0 to change, 0 to destroy.”
  • Deprovisioning shows: Plan: 0 to add, 0 to change, 54 to destroy.
  • I see in the logs before the destroy starts that the objects are there and the state was “refreshing”. I could see that for the network interface and virtual machine object. I can only assume the refresh phase caused the removal from the remote state file - if that assumptions right, under what conditions would that occur? The VM definition is no different to the Windows VM that is provisioned using the other child module.
  • Tried splitting the git action workflow into two separate jobs, and splitting each subscription with its own module and its own terraform state file. Having different state files doesn’t change the behaviour even though i’ve split everything out.
  • state file for the prob subscription has all resources as well as shows the resources dependencies. Looks clean before destroy. This sort of feels like a bug with the destroy azure check/configuration diff logic forcing state to drop resources even though they are indeed still present (but don’t know enough to definitively say its a bug).

I can’t make sense of why I’m experiencing this behavior and hopefully someone has seen this before and can shed some light. I’m sure it’s probably operator error. Happy to share any other information that may help diagnose this further.

Thanks
Andrew