How to handle VM backup restoration jobs with Azure Backup?

We are using Azure Backup to back-up and restore virtual machines. The goal is to create snapshots of the virtual machines at a set interval and be able to roll back virtual machines to these snapshots. We want to use the native experience to restore backups, as documented in the Microsoft Docs here: Restore VMs by using the Azure portal using Azure Backup - Azure Backup | Microsoft Learn. We want to be able to manage the restored disks using Terraform. For example, resizing a disk after a restoration job should be possible.

We are deploying the following resources:

  • azurerm_linux_virtual_machine
  • azurerm_managed_disk
  • azurerm_virtual_machine_data_disk_attachment

Let’s say a minimal deployment looks something like this:

resource "azurerm_linux_virtual_machine" "my_vm" {
  name                = "my-vm"
  resource_group_name = azurerm_resource_group.my_resource_group.name
  location            = azurerm_resource_group.my_resource_group.location
  size                = "Standard_B4s_v2"
  admin_username      = "adminuser"
  admin_password      = "mypassword"

  network_interface_ids = [
    azurerm_network_interface.my_nic.id,
  ]

  os_disk {
    name                 = "my-os-disk"
    caching              = "ReadWrite"
    storage_account_type = "Standard_LRS"
    disk_size_gb         = 64
  }

  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-jammy"
    sku       = "22_04-lts"
    version   = "latest"
  }
}

resource "azurerm_managed_disk" "my_disk" {
  name                 = "my-data-disk"
  resource_group_name  = azurerm_resource_group.my_resource_group.name
  location             = azurerm_resource_group.my_resource_group.location
  storage_account_type = "Standard_LRS"
  create_option        = "Empty"
  disk_size_gb         = 128
}

resource "azurerm_virtual_machine_data_disk_attachment" "my_disk_attachment" {
  virtual_machine_id = azurerm_linux_virtual_machine.my_vm.id
  managed_disk_id    = azurerm_managed_disk.my_disk.id
  lun                = "0"
  caching            = "ReadWrite"
}

After performing a VM restoration job with replace disks, the original OS disk and data disk will be detached. A new OS disk and data disk will be created and attached at the same LUN as the old data disk.

These actions break future deployments. Terraform will notice that the OS disk name has changed and that the data disk attachment no longer exists. Thus, it will propose to replace the existing OS disk and attach the data disk at the LUN that is already occupied by the restored data disk - leading to an error.

azurerm_resource_group.my_resource_group: Refreshing state... [id=/subscriptions/5ab24a52-44e0-4bdf-a879-cc38371a4403/resourceGroups/rsv-test]
azurerm_recovery_services_vault.recovery_services_vault: Refreshing state... [id=/subscriptions/5ab24a52-44e0-4bdf-a879-cc38371a4403/resourceGroups/rsv-test/providers/Microsoft.RecoveryServices/vaults/my-recovery-services-vault]
azurerm_virtual_network.my_virtual_network: Refreshing state... [id=/subscriptions/5ab24a52-44e0-4bdf-a879-cc38371a4403/resourceGroups/rsv-test/providers/Microsoft.Network/virtualNetworks/my-virtual-network]
azurerm_managed_disk.my_disk: Refreshing state... [id=/subscriptions/5ab24a52-44e0-4bdf-a879-cc38371a4403/resourceGroups/rsv-test/providers/Microsoft.Compute/disks/my-data-disk]
azurerm_backup_policy_vm.recovery_services_vault_policy: Refreshing state... [id=/subscriptions/5ab24a52-44e0-4bdf-a879-cc38371a4403/resourceGroups/rsv-test/providers/Microsoft.RecoveryServices/vaults/my-recovery-services-vault/backupPolicies/my-recovery-services-vault-policy]
azurerm_storage_account.recovery_services_staging_storage_account: Refreshing state... [id=/subscriptions/5ab24a52-44e0-4bdf-a879-cc38371a4403/resourceGroups/rsv-test/providers/Microsoft.Storage/storageAccounts/mystagingstorageaccount5]
azurerm_subnet.my_subnet: Refreshing state... [id=/subscriptions/5ab24a52-44e0-4bdf-a879-cc38371a4403/resourceGroups/rsv-test/providers/Microsoft.Network/virtualNetworks/my-virtual-network/subnets/my-subnet]
azurerm_network_interface.my_nic: Refreshing state... [id=/subscriptions/5ab24a52-44e0-4bdf-a879-cc38371a4403/resourceGroups/rsv-test/providers/Microsoft.Network/networkInterfaces/my-nic]
azurerm_linux_virtual_machine.my_vm: Refreshing state... [id=/subscriptions/5ab24a52-44e0-4bdf-a879-cc38371a4403/resourceGroups/rsv-test/providers/Microsoft.Compute/virtualMachines/my-vm]
azurerm_virtual_machine_data_disk_attachment.my_disk_attachment: Refreshing state... [id=/subscriptions/5ab24a52-44e0-4bdf-a879-cc38371a4403/resourceGroups/rsv-test/providers/Microsoft.Compute/virtualMachines/my-vm/dataDisks/my-data-disk]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # azurerm_linux_virtual_machine.my_vm must be replaced
-/+ resource "azurerm_linux_virtual_machine" "my_vm" {
      ~ computer_name                                          = "my-vm" -> (known after apply)
      + disk_controller_type                                   = (known after apply)
      - encryption_at_host_enabled                             = false -> null
      ~ id                                                     = "/subscriptions/5ab24a52-44e0-4bdf-a879-cc38371a4403/resourceGroups/rsv-test/providers/Microsoft.Compute/virtualMachines/my-vm" -> (known after apply)
        name                                                   = "my-vm"
      ~ private_ip_address                                     = "10.0.1.4" -> (known after apply)
      ~ private_ip_addresses                                   = [
          - "10.0.1.4",
        ] -> (known after apply)
      + public_ip_address                                      = (known after apply)
      ~ public_ip_addresses                                    = [] -> (known after apply)
      - secure_boot_enabled                                    = false -> null
      - tags                                                   = {} -> null
      ~ virtual_machine_id                                     = "94c49729-36be-4c1a-9f0d-9ab19b616cfc" -> (known after apply)
      - vtpm_enabled                                           = false -> null
        # (17 unchanged attributes hidden)

      ~ os_disk {
          ~ name                      = "myvm-osdisk-20240808-114805" -> "my-os-disk" # forces replacement
            # (4 unchanged attributes hidden)
        }

        # (1 unchanged block hidden)
    }

  # azurerm_virtual_machine_data_disk_attachment.my_disk_attachment will be created
  + resource "azurerm_virtual_machine_data_disk_attachment" "my_disk_attachment" {
      + caching                   = "ReadWrite"
      + create_option             = "Attach"
      + id                        = (known after apply)
      + lun                       = 0
      + managed_disk_id           = "/subscriptions/5ab24a52-44e0-4bdf-a879-cc38371a4403/resourceGroups/rsv-test/providers/Microsoft.Compute/disks/my-data-disk"
      + virtual_machine_id        = (known after apply)
      + write_accelerator_enabled = false
    }

Plan: 2 to add, 0 to change, 1 to destroy.

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Note: You didn't use the -out option to save this plan, so Terraform can't guarantee to take exactly these actions if you run "terraform apply" now.

How should we approach this situation so that we can manage the restored disks with Terraform and not break future deployments?

1 Like

Any luck with this? Have the same use case as well. Sounds like it’s a common topic because there’s an additional thread below:

Hi Andrew, thanks for following up.

In the end, I resorted to adding a pre-deployment script to my CI. This script checks if the disks attached to the VM have been restored by a back-up restoration job by querying the VM using the Azure CLI and checking if the createOption is equal to Restore. If so, it removes the disk and disk attachment resources from the Terraform state and adds the new resources to the Terraform state. It is not my ideal solution, but it is working well.

Unfortunately I can not share an example as I no longer have access to that source code. I hope it helps either way.

This does not address the question of how to manage Terraform state before and after the backup

Hi iTiamo thank you for sharing! I’m leaning towards a similar approach. I basically see two approaches:

  • Approach 1 - Manual Restore / Manual TF State Management: This is similar to the approach you’ve outlined which is basically to provide operations the ability to perform restores directly via the console or CLI and then post-restore clean up the TF state. I hadn’t thought of incorporating into the CI/CD pipeline itself, but either way the concepts are similar
  • Approach 2 - TF Restore / Automated TF State Management: This approach ring fences the restore itself into the TF code. For small deployments I don’t see a lot of issues with this, but for larger deployments where you want β€œimmutable” infra repos to progress through staged environments you would need a way to parameterize the restore via features flags / import blocks to basically have the TF Code pre-baked with the restored VM, but don’t provision it unless the feature flag is turned on. Said another way, if you’re managing 1 VM you’d define the VM and the disks in TF code and then you’d have a separate VM resource with disks that are set to create mode β€˜restore’ which would only be provisioned if a TF variable was set.

I’m unsure on the technical feasibility of Option 2 for VMs specifically. I know it’s possible with PSQL Flexible Server that has the create mode optionality and ability to input a specific restore point.

I’m not sure about the feasibility of option 2 either. For me, the use case I wanted to support was in place restoration of virtual machines only and the approach outlined had proven to work for 100+ VMs. The script was fairly simple to run predeployment in CI.

1 Like

I’ve been digging into this and I’m now considering lifecycle changes as a potential approach. Was this considered when you were designing this out as option?

Would be great to hear from Azure or Hashicorp on this topic as well.