The problem of "null_resource" not working when terraform codes are re-executed

Hello there,

I have a problem and I couldn’t find a solution. I did a lot of experimenting and found a lot of information. But I feel that I am going blind, so I wanted to consult you.

I use Google cloud, I manage my resources completely with terraform. I have added my terraform configuration below.

I want to do;
1- creating my disks and server (that’s ok),
2- running my commands with “remote-exec” after my resources are created (this is ok)
3- being able to automatically repeat the same operations in a disaster scenario (there is a problem with this)

My problem is:
When I follow the instructions below for the first time, I don’t have a problem, a clean and beautiful installation is done. But when I try to delete the instance and boot disk and add them again, my “remote-exec” commands in “null_resource” do not work again.

My goal is to implement the disaster scenario. I add the boot and data disks to my server separately. Linux and application services are on the boot disk, and the data of the applications are on the data disk. I keep the disks so that they are not deleted while the instance is being deleted. In any case my disks remain stable.

I don’t have any problems so far.

But in case of a possible problem scenario, when I have a problem with linux or the application, I want to run the terraform code again. When I run the terraform codes again, while the boot disk is being reinstalled, I want my “remote-exec” commands in “null_resource” to run automatically again.

I’m testing by manually deleting my instance and boot disk via google cloud console. After it is deleted, I check with “terraform plan” and see that it can install the boot disk and instance again without any problems. But it doesn’t run the “null_resource” resource and my “remote-exec” commands in it again. As such, I cannot automatically configure my linux server and install my applications.

I found and tried many resources on the use of null_resource on google. But I did not get successful results. I guess either I haven’t learned to use this resource or I’m confused.

If I run the terraform code again when my boot disk is deleted, I want it to run again in my null_resource remote-exec commands, together with the boot disk and instance. If the boot disk has not been reinstalled, keep it as it first worked. If the boot disk has not been reinstalled, if I add a different resource on the terraform or change the instance resources, I want the null_resource to remain as it was originally and not reinstall.

Normally, I set up 3 servers with “count” and for example, these servers are elasticsearch clusters. But now I reduced the count to 1 for testing. After seeing that I can do this on 1 server, I will increase the count to 3 and my other servers…

Sorry for explaining my problem a little long. I’m open to suggestions on how to do this.

variable "regions" {
  default = "europe-north1"
}

variable "zones" {
  default = ["europe-north1-a", "europe-north1-b", "europe-north1-c"]
}

variable "instance_name" {
  default = {
    "xxxapplication" = "example-apps-server"
  }
}

variable "instance_count" {
  default = {
    "xxxapplication" = "1"
  }
}

variable "internal_ip_pools" {
  default = ["192.168.1.25", "192.168.1.26", "192.168.1.27"]
}


### Boot Disk
resource "google_compute_disk" "my_test_servers_instance_boot_disk" {
  name = "${var.instance_name["xxxapplication"]}${format("%02d", count.index+1)}-boot-disk"
  zone = var.zones[count.index % length(var.zones)]
  image = "debian-latest"
  count = var.instance_count["xxxapplication"]
  type  = "pd-ssd"
  size = "50"
}

### Data Disk
resource "google_compute_disk" "my_test_servers_instance_data_disk" {
  depends_on = [google_compute_instance.my_test_servers_instance]
  name = "${var.instance_name["xxxapplication"]}${format("%02d", count.index+1)}-data-disk"
  zone = var.zones[count.index % length(var.zones)]
  count = var.instance_count["xxxapplication"]
  type  = "pd-ssd"
  size = "100"
}

### Attach Disk
resource "google_compute_attached_disk" "my_test_servers_instance_attach_data_disk" {
  depends_on = [google_compute_disk.my_test_servers_instance_data_disk]
  mode = "READ_WRITE"
  zone = var.zones[count.index % length(var.zones)]
  count = var.instance_count["xxxapplication"]
  disk = element(google_compute_disk.my_test_servers_instance_data_disk.*.name, count.index)
  instance = element(google_compute_instance.my_test_servers_instance.*.self_link, count.index)
}


resource "google_compute_instance" "my_test_servers_instance" {
  depends_on = [google_compute_disk.my_test_servers_instance_boot_disk]
  name = "${var.instance_name["xxxapplication"]}${format("%02d", count.index+1)}"
  hostname = "${var.instance_name["xxxapplication"]}${format("%02d", count.index+1)}.odeeontechnology.com"
  machine_type = var.gcp_instance_size["4cpu-22mem"]

  zone = var.zones[count.index % length(var.zones)]
  count = var.instance_count["xxxapplication"]
  tags = ["ssh-access"]

  boot_disk {
    auto_delete = false
    source = element(google_compute_disk.my_test_servers_instance_boot_disk.*.name, count.index)
  }

  lifecycle {
    ignore_changes = [attached_disk]
  }

  metadata = {
    serial-port-enable = "true"
  }


  network_interface {
    network_ip = var.internal_ip_pools[count.index % length(var.internal_ip_pools)]
    network = google_compute_network.master.self_link
    subnetwork = google_compute_subnetwork.secondary.name


    access_config {
      nat_ip = element(google_compute_address.my_test_servers_public_ip.*.address, count.index)
    }
  }

}

resource "null_resource" "run_commands" {
  count = var.instance_count["xxxapplication"]

  depends_on = [google_compute_attached_disk.my_test_servers_instance_attach_data_disk]

  triggers = {
    disklist = element(google_compute_disk.my_test_servers_instance_boot_disk.*.name, count.index)

  }

  provisioner "remote-exec" {
    on_failure = continue
    connection {
      type = "ssh"
      user = "testuser"
      host = element(google_compute_address.my_test_servers_public_ip.*.address, count.index)
      agent = false
      password = "1234555"
    }

    inline = [
      "sudo ls -lah /etc >> /tmp/list.txt "
      ...
    ]
  }
}

By the way, I forgot to give some information. I was running the {provisioner “remote-exec”} code block in {resource “google_compute_instance”} before. But I had to move it into “null_resource” due to automatic disk installation processes, running special scripts.

Normally I don’t have this problem in {resource “google_compute_instance”}. But this time I can’t do what I want.

Because if I use it like this, I need to attach my disks after the instance installation. I cannot run my automatic disk configuration script in the server without the disks attached. I can do something with commands like “nohup” in linux, but it makes things very difficult.

I chose to use “null_resource” as it is easy and convenient for me.

I think I solved my problem. Blindness from dealing with too many details.

I edited the trigger in the “null_resource” resource as follows and it worked. Successful in my tests.

if my method is not correct, I am open to your suggestions about it. If I’m doing it right, this task can be closed as solved.

  triggers = {
    ids = element(google_compute_disk.my_test_servers_instance_boot_disk.*.id, count.index)
  }

Does it mean the command fails or the command is not executed?