Failed to restore access_policies when key_vault has recovered

I have provisioned the key_vault enabling the recovery_soft_delete_keyvaults azurerm feature and azurerm_key_vault_access_policy in that.
I would like to test a Disaster Recovery of keyvault by deleting it from Azure portal or Terraform cli.

in the first case, i have deleted the keyvault from the portal, which was also deleted the access_policies, when i rerun the terraform apply from cli to recover the key_vault resource, it was also restored access_policy that i have provisioned as a external resource. since i my code has the depends_on block in access_policy for key_vault, this was deleted from the tf statefile and tried to recreate when i apply it which is failed with an error, access policy resource is already existed.
terraform code

resource "azurerm_key_vault" "example" {
  name                = "examplekeyvault"
  }
resource "azurerm_key_vault_access_policy" "example" {
  key_vault_id = azurerm_key_vault.example.id
  tenant_id    = data.azurerm_client_config.current.tenant_id
  object_id    = data.azurerm_client_config.current.object_id

 
depends_on = [azurerm_key_vault.example]
}

in case i have deleted the key_vault from terraform cli using terraform destroy, first it will delete the keyvault along with access policy, same has been removed from tf state file. however, when i ran the terraform apply, it was recovered both keyvault and access policy, terraform again try to recreate access_policy resource as it was deleted but azure has recovered it along with key_vault.

it is complex scenario.

Fixes i have tried.

Adding the lifecycle = [access_policy] block in the key_vault, this will fix my issue however, in future, if i would like to change the access_policy, key_vault doesnt allow.

Moving the external key_vault_access_policy resource to native key_vault. but it will delete the existing access policies and recreate them, this will break the authentication issues to my resources.
is there any way that i have fix this issue?

Hi @Krishg,

I have been able to reproduce the issue in only one of your scenarios (unless I have misunderstood them)

Running a terraform apply after terraform destroy:

In this scenario I received no errors. Because the destroy carries out the operations in reverse of the apply (due to dependencies) the following occurs during destroy

azurerm_key_vault_access_policy.example: Destroying... 
azurerm_key_vault_access_policy.example: Destruction complete after 6s
azurerm_key_vault.example: Destroying... 
azurerm_key_vault.example: Destruction complete after 2s

Therefore the keyvault, when it is deleted no longer has any access policies. So when you next apply there are no errors, as the keyvault is created and then the access policies recreated:

azurerm_key_vault.example: Creating...
azurerm_key_vault.example: Creation complete after 2m8s 
azurerm_key_vault_access_policy.example: Creating...
azurerm_key_vault_access_policy.example: Creation complete after 6s

Running a terraform apply after deleting the keyvault from the portal:

In this case I did see the behaviour you describe:

azurerm_key_vault_access_policy.example: Creating...
╷
│ Error: A resource with the ID "/subscriptions/***/resourceGroups/example-resources/providers/Microsoft.KeyVault/vaults/example-keyvault-sp1999/objectId/*** 
already exists - to be managed via Terraform this resource needs to be imported 
into the State. Please see the resource documentation for "azurerm_key_vault_access_policy" for more information.

And this is because the keyvault is deleted while there are still access policies attached. As you say, when the azurerm provider restores the keyvault it does so with the attached access policies. This then causes the creation of the azurerm_key_vault_access_policy to fail as the access policies exist (and have appeared since the plan determined that they, along with the keyvault, required recreation during the state refresh)

Conclusion

This could probably be reported to the Azurerm provider maintainers as an issue Issues · hashicorp/terraform-provider-azurerm (github.com) but it is somewhat of an edge case.

Unfortunately I don’t have a way to ‘work-around’ this issue

A few comments on these scenarios, however, which would mitigate the risk of them occurring:

  • If you are managing your infrastructure via Terraform it is good-practice to restrict access to the resources via the portal (perhaps only allow read) to ensure that all infrastructure changes must be applied via Terraform and any changes would go through the appropriate deployment pipelines, gates and approvals. No access = no ability to delete :slight_smile:
  • In the case of deleting via the portal, a recovery via the portal would mean that this issue would not occur (providing recovery was done prior to a terraform apply being accepted). Again, I would expect that this issue would be picked up on a pipeline plan stage and the pre-apply gate/reviewer would see the keyvault and its access policies were being shown as needed to be created when they were not expected to be
  • The addition of a delete resource lock (Applied outside of the module that creates the keyvault) would prevent the keyvault being deleted via terraform or the portal without manual intervention (to remove the lock) -
    • The lock could be applied via a separate terraform module and pipeline
    • The lock could be applied via Azure Policy at resource creation (This could be a policy specifically targeting keyvault resources)

Hope that helps.

Lastly:

As you are referring (via an expression reference) to an attribute azurerm_key_vault.example.id in the below resource this creates an implicit dependency to azurerm_key_vault.example. Therefore your explicit dependency is not required.

resource "azurerm_key_vault_access_policy" "example" {
  key_vault_id = azurerm_key_vault.example.id # <------- Implicit dependency
  tenant_id    = data.azurerm_client_config.current.tenant_id
  object_id    = data.azurerm_client_config.current.object_id
depends_on = [azurerm_key_vault.example] # <------ This is not needed
}

As per The depends_on Meta-Argument - Configuration Language | Terraform | HashiCorp Developer:

You should use depends_on as a last resort because it can cause Terraform to create more conservative plans that replace more resources than necessary.

Happy Terraforming

Thanks for your response @ExtelligenceIT , yes, your case is 1 is valid. workarounds are already in place. as you said, these are edge cases but needs to be implemented incase of any uncertainty.

my orginal code, access policy was trying to create running loop over the outside variables but not directly on keyvault resource. here is the updated code.

"resource “azurerm_key_vault_access_policy” “example” {
for_each = var.key.access
key_vault_id = azurerm_key_vault.example.id
tenant_id = data.azurerm_client_config.current.tenant_id
object_id = data.azurerm_client_config.current.object_id

depends_on = [azurerm_key_vault.example]
}"

Interesting part is, why terraform doesn’t have any way to refresh/update the state when the other resource got restored…

I think the issue here is that the state is refreshed, but at the start of the planning phase. At this point the actual state and the desired state differs by both the keyvault and the access policies. The plan then is computed based upon these differences (both resource)

At apply time the dependency graph determines that the keyvault should first be deployed / restored. Which occurs but also that then restores the access policy - and the restore is handled by the Azure resource manager so Terraform has no control of exactly what it does. As terraform is working off its plan, it goes to apply these access policies and finds that the access policies have ‘magically’ appeared but it has not deployed them. Therefore they now are treated as an existing resource that now needs to be imported.