Ignoring changes for a specific provider during disaster recovery

brariden · September 10, 2019, 9:42pm

I’m building an application that, for disaster recovery purposes, is active/active in two regions. To do this, my terraform entry point uses two different provider blocks (one with region A and another with region B), and call a reusable module twice passing the appropriate provider. This is almost exactly like the guidance here: https://www.terraform.io/docs/configuration/modules.html#passing-providers-explicitly

I prefer having this setup because it makes it easy to reference things created in other regions. For example, enabling data replication is very easy with this setup. However, I just realized that if one region goes down, the provider connecting to the bad region will hang completely. That means during outages, no terraform applys would be possible, which is a problem!

Is there a way instead for a specific run to ignore a provider completely, so that in the case of a DR situation I could still make changes??

sl1pm4t · September 10, 2019, 11:18pm

TL;DR;
Use the -target argument of plan / apply so Terraform only includes the resources and therefore providers of the available region.

To demonstrate use following TF code to test against a working and “broken” Docker provider:

main.tf

# ~~~~~~~~~~~~~~~~
# Available Region
# ~~~~~~~~~~~~~~~~
provider "docker" {
  alias = "up"
  host  = "unix:///var/run/docker.sock"
}

module "working_region" {
  source = "./vault_module"
  providers = {
    docker = "docker.up"
  }

  container_name = "vault_region1"
}

# ~~~~~~~~~~~~~~~~~~
# Unavailable Region
# ~~~~~~~~~~~~~~~~~~
provider "docker" {
  alias = "down"

  # Configure provider with some non-existent IP or Port
  host = "tcp://127.0.0.1:5555/"
}

module "broken_region" {
  source = "./vault_module"
  providers = {
    docker = "docker.down"
  }

  container_name = "vault_region2"
}

vault_module/main.tf

variable "container_name" {
  default = "vault"
}

# Create a container
resource "docker_container" "vault" {
  image = "vault"
  name  = var.container_name

  capabilities {
    add = ["IPC_LOCK"]
  }

  env = [
    "VAULT_DEV_ROOT_TOKEN_ID=myroot",
  ]

  ports {
    internal = 8200
    external = 8200
  }
}

If you run terraform apply, Terraform will complain it can’t contact one of the Docker providers.

$ terraform apply

Error: Error pinging Docker server: Cannot connect to the Docker daemon at tcp://127.0.0.1:5555/. Is the docker daemon running?

  on main.tf line 21, in provider "docker":
  21: provider "docker" {

However, if you run with the -target module.working_region it will ignore the broken region and proceed:

terraform apply -target module.working_region

...

Plan: 1 to add, 0 to change, 0 to destroy.

brariden · September 11, 2019, 1:30pm

Thanks Matt, this helps a lot!

apparentlymart · September 11, 2019, 4:10pm

-target is definitely an option, and although we do normally caution against using it in non-exceptional circumstances, disaster recovery is hopefully an exceptional circumstance and therefore a reasonable time to use it!

I did want to share the other alternative though: we usually recommend using multiple configurations to allow more granular updates, separating things that don’t need to be updated together into separate configurations with their own separate states. That way your everyday use keeps changes relatively isolated, not just during disaster recovery.

To achieve this in a previous job (before I was working at HashiCorp) we split the lowest-level infrastructure for an environment into separate configurations so that there was one configuration per region and then one more “global” configuration that gathered the results of the per-region configurations and dealt with the remaining objects that are not region specific, such as a DNS zone.

Then in everyday use we were updating only one region at a time. For certain changes that of course made us take multiple steps, which was sometimes inconvenient, but on the whole it was nice to keep the scope of changes relatively small in case something went wrong: it’s easier to recover from a mishap affecting only a small portion of the total workload than a mishap affecting everything.

Both of these approaches are fine and have some different tradeoffs. If your routine work is helped by having everything grouped together and you’re only using -target in exceptional circumstances for disaster recovery then it could be okay. I would caution that if you find yourself routinely using -target for some reason then that’s a good signal that it’s time to refactor into multiple configurations.