Null_resource with provisioner remote_exec errors even when the script has exited successfully after failing in the previous run

Hello, just upgraded from Terraform 1.3.3. to 1.3.9 after I got tired of seeing it crash on several occasions. The upgrade fixed the crash, but now I’m seeing another problem that I dont believe was seen earlier.

null_resource is used to run bunch of local and remote provisioners(we need this for the complex cluster/storage stack config) and recently I added the output/exeStatus file collection.

null_resource is reporting failure even when all of the provisioners have completed successfully. The ‘debug’ revealed this error


provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"

The problem is interesting, it only seems to happen when remote_exec has failed once before and terraform apply was repeated after the error in script was fixed. A failure in null_resource leads to subsequent resource is not run. Until recently, null_resource was the last item but recently script output handling has been added, so the problem might have been there in earlier versions and was silent so was unnoticed. A third rerun will actually not even rerun null_resource and will process to the end.

Relevant code:

resource "null_resource" "cluster_config" {
  depends_on = [
    local_file.scale_config_params
  ]

  # Changes to any instance of the cluster requires re-provisioning
  triggers = {
...
}
...
 provisioner "remote-exec" {
    inline    = [
        "set -o errexit",
        "chmod +x  /home/${var.admin.sshUserName}/scale-config-launchpad.sh",
        "/bin/sudo /home/${var.admin.sshUserName}/scale-config-launchpad.sh ${local.masterNodeFEIP} ${var.clusterDeployMode} ${local.clusterExpCounter} /home/${var.admin.sshUserName}/${local.scaleParmsFile} >/home/${var.admin.sshUserName}/stdout",


    ]
    on_failure = fail    # OR continue
  }

  # Return code file 'exitstatus' on master is created by scale-config-launchpad.sh; 'stdout' was generated in 'remote_exec'
  # Gather both those file locally for further processing of the output.
  provisioner "local-exec" {
    command =  <<EOC
      scp -i ${var.admin.sshPrivatekeyPath} -o LogLevel=error -o StrictHostKeyChecking=no -o "UserKnownHostsFile=/dev/null" ${self.triggers.user}@${self.triggers.host}:/home/${self.triggers.user}/stdout ${local.localTempdir}/;
      scp -i ${var.admin.sshPrivatekeyPath} -o LogLevel=error -o StrictHostKeyChecking=no -o "UserKnownHostsFile=/dev/null" ${self.triggers.user}@${self.triggers.host}:/home/${self.triggers.user}/exitstatus ${local.localTempdir}/;
      ls -l ${local.localTempdir}/ ;
    EOC
    interpreter = ["/bin/bash", "-c"]
  }
}

resource "random_uuid" "complete" {
  depends_on = [
    null_resource.cluster_config
  ]
}

#--- from data-sources.tf---
data "local_file" "scaleConfigStatus" {
  filename = "${local.localTempdir}/exitstatus"

  depends_on = [
   resource.random_uuid.complete
  ]

  lifecycle {
    precondition {
      condition     = random_uuid.complete.result != null
      error_message = "Scale config did not finish successfully"
    }
  }
}

data "local_file" "scaleConfigOutput" {
  filename = "${local.localTempdir}/stdout"

  depends_on = [
   resource.random_uuid.complete
  ]

  lifecycle {
    precondition {
      condition     = random_uuid.complete.result != null
      error_message = "Scale config did not finish successfully"
    }
  }
}

Upon first run - remote-exec fails:

module.vmscale-gcp[0].null_resource.cluster_config (local-exec): Preparing source code to upload...

module.vmscale-gcp[0].null_resource.cluster_config (local-exec): Packing up scale backend config dirs from '/Users/neetagarimella/syscale' using the following tar command.

module.vmscale-gcp[0].null_resource.cluster_config (local-exec): tar czf /Users/neetagarimella/syscale/TF/tmp/scale-config-load-VERSION_STRING.tar.gz tools lib

module.vmscale-gcp[0].null_resource.cluster_config: Provisioning with 'file'...

module.vmscale-gcp[0].null_resource.cluster_config: Provisioning with 'file'...

module.vmscale-gcp[0].null_resource.cluster_config: Provisioning with 'file'...

module.vmscale-gcp[0].null_resource.cluster_config: Provisioning with 'remote-exec'...

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Connecting to remote host via SSH...

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Host: 34.145.97.171

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): User: sycompacct

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Password: false

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Private key: true

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Certificate: false

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): SSH Agent: true

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Checking Host Key: false

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Target Platform: unix

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Connected!

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Error: The actual scale nodecount of '4' in the cluster is different from the passed current scale node count '3'. Please review parameters.

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): #LINE#Failed to retrieve or validate parameters from '/home/sycompacct/scaleParams_E1.json'. Execution time: '2'seconds, exit code: '2'. #LINE#

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): For installation log, please refer to log file 'sycomp-scale-UPDATE_20230307.232913.log'.

╷

│ Error: remote-exec provisioner error

│

│ with module.vmscale-gcp[0].null_resource.cluster_config,

│ on ../modules/sycomp-vmscale-gcp/main.tf line 888, in resource "null_resource" "cluster_config":

│ 888: provisioner "remote-exec" {

│

│ error executing "/tmp/terraform_375047149.sh": Process exited with status 2

Second run (fix error in the remote script) - remote-exec completes: The missing files are very much there as revealed in the debug output later.


module.vmscale-gcp[0].null_resource.cluster_config (local-exec): Packing up scale backend config dirs from '/Users/neetagarimella/syscale' using the following tar command.

module.vmscale-gcp[0].null_resource.cluster_config (local-exec): tar czf /Users/neetagarimella/syscale/TF/tmp/scale-config-load-VERSION_STRING.tar.gz tools lib

module.vmscale-gcp[0].null_resource.cluster_config: Provisioning with 'file'...

module.vmscale-gcp[0].null_resource.cluster_config: Provisioning with 'file'...

module.vmscale-gcp[0].null_resource.cluster_config: Provisioning with 'file'...

module.vmscale-gcp[0].null_resource.cluster_config: Provisioning with 'remote-exec'...

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Connecting to remote host via SSH...

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Host: 34.145.97.171

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): User: sycompacct

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Password: false

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Private key: true

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Certificate: false

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): SSH Agent: true

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Checking Host Key: false

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Target Platform: unix

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Connected!

module.vmscale-gcp[0].data.local_file.scaleConfigStatus: Still reading... [10s elapsed]

module.vmscale-gcp[0].data.local_file.scaleConfigOutput: Still reading... [10s elapsed]

module.vmscale-gcp[0].null_resource.cluster_config: Still creating... [10s elapsed]

module.vmscale-gcp[0].data.local_file.scaleConfigStatus: Still reading... [20s elapsed]

module.vmscale-gcp[0].data.local_file.scaleConfigOutput: Still reading... [20s elapsed]

module.vmscale-gcp[0].null_resource.cluster_config: Still creating... [20s elapsed]

module.vmscale-gcp[0].data.local_file.scaleConfigOutput: Still reading... [30s elapsed]

module.vmscale-gcp[0].data.local_file.scaleConfigStatus: Still reading... [30s elapsed]

module.vmscale-gcp[0].null_resource.cluster_config: Still creating... [30s elapsed]

module.vmscale-gcp[0].null_resource.cluster_config: Provisioning with 'local-exec'...

module.vmscale-gcp[0].null_resource.cluster_config (local-exec): Executing: ["/bin/bash" "-c" " scp -i ~/sycompacct-key -o LogLevel=error -o StrictHostKeyChecking=no -o \"UserKnownHostsFile=/dev/null\" sycompacct@34.145.97.171:/home/sycompacct/stdout /Users/neetagarimella/syscale/TF/tmp/;\n scp -i ~/sycompacct-key -o LogLevel=error -o StrictHostKeyChecking=no -o \"UserKnownHostsFile=/dev/null\" sycompacct@34.145.97.171:/home/sycompacct/exitstatus /Users/neetagarimella/syscale/TF/tmp/;\n sync; sleep 1;\n"]

module.vmscale-gcp[0].null_resource.cluster_config: Creation complete after 37s [id=2508484827329135134]

╷

│ Error: Read local file data source error

│

│ with module.vmscale-gcp[0].data.local_file.scaleConfigStatus,

│ on ../modules/sycomp-vmscale-gcp/data-sources.tf line 51, in data "local_file" "scaleConfigStatus":

│ 51: data "local_file" "scaleConfigStatus" {

│

│ The file at given path cannot be read.

│

│ +Original Error: open /Users/neetagarimella/syscale/TF/tmp/exitstatus: no such file or directory

╷

│ Error: Read local file data source error

│

│ with module.vmscale-gcp[0].data.local_file.scaleConfigOutput,

│ on ../modules/sycomp-vmscale-gcp/data-sources.tf line 66, in data "local_file" "scaleConfigOutput":

│ 66: data "local_file" "scaleConfigOutput" {

│

│ The file at given path cannot be read.

│

│ +Original Error: open /Users/neetagarimella/syscale/TF/tmp/stdout: no such file or directory

╵

Third run

module.vmscale-gcp[0].data.google_compute_instance.masterNodeData: Read complete after 0s [id=projects/sycomp-storage/zones/us-west1-b/instances/ng2mgmt01]

module.vmscale-gcp[0].google_compute_instance.scale[0]: Refreshing state... [id=projects/sycomp-storage/zones/us-west1-b/instances/ng2scale04]

module.vmscale-gcp[0].null_resource.cluster_config: Refreshing state... [id=2508484827329135134]

module.vmscale-gcp[0].random_uuid.complete: Refreshing state... [id=9584f45b-08d7-61b4-dae2-1c1ea2b8d131]

module.vmscale-gcp[0].data.local_file.scaleConfigOutput: Reading...

module.vmscale-gcp[0].data.local_file.scaleConfigStatus: Reading...

module.vmscale-gcp[0].data.local_file.scaleConfigOutput: Read complete after 0s [id=0dda603dc11e661e305e7d96e8744b35bc5b4d65]

module.vmscale-gcp[0].data.local_file.scaleConfigStatus: Read complete after 0s [id=09d2af8dd22201dd8d48e5dcfcaed281ff9422c7]

Changes to Outputs:

~ scaleConfigLogOutput = {

~ configDuration = "31 Seconds" -> "30 Seconds"

~ configLogFile = "sycomp-scale-UPDATE_20230307.231203.log" -> "sycomp-scale-UPDATE_20230307.232556.log"

~ meterUnits = "scaleStorageTBUnits:3 scaleRhelUnits:5 scaleSDMUnits:0 Install-Time:2023-03-07 23:12:33.459" -> "scaleStorageTBUnits:3 scaleRhelUnits:5 scaleSDMUnits:0 Install-Time:2023-03-07 23:26:25.681"

# (2 unchanged elements hidden)

}

You can apply this plan to save these new output values to the Terraform state, without changing any real infrastructure.

Do you want to perform these actions in workspace "ng2-E1"?

Terraform will perform the actions described above.

Only 'yes' will be accepted to approve.

Enter a value:

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Outputs:

Second - failed run was captured in the debug output.
Note ls -l of the failed files was added.

module.vmscale-gcp[0].null_resource.cluster_config: Provisioning with 'remote-exec'...

2023-03-07T15:31:53.293-0800 [INFO] using private key for authentication

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Connecting to remote host via SSH...

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Host: 34.145.97.171

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): User: sycompacct

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Password: false

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Private key: true

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Certificate: false

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): SSH Agent: true

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Checking Host Key: false

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Target Platform: unix

2023-03-07T15:31:53.294-0800 [DEBUG] Connecting to 34.145.97.171:22 for SSH

2023-03-07T15:31:53.326-0800 [DEBUG] Connection established. Handshaking for user sycompacct

2023-03-07T15:31:53.592-0800 [DEBUG] Telling SSH config to forward to agent

2023-03-07T15:31:53.592-0800 [DEBUG] Setting up a session to request agent forwarding

2023-03-07T15:31:53.683-0800 [INFO] agent forwarding enabled

module.vmscale-gcp[0].null_resource.cluster_config (remote-exec): Connected!

2023-03-07T15:31:53.683-0800 [DEBUG] starting ssh KeepAlives

2023-03-07T15:31:53.683-0800 [DEBUG] opening new ssh session

2023-03-07T15:31:53.718-0800 [DEBUG] Starting remote scp process: 'scp' -vt /tmp

2023-03-07T15:31:53.752-0800 [DEBUG] Started SCP session, beginning transfers...

2023-03-07T15:31:53.752-0800 [DEBUG] Beginning file upload...

2023-03-07T15:31:53.799-0800 [DEBUG] SCP session complete, closing stdin pipe.

2023-03-07T15:31:53.800-0800 [DEBUG] Waiting for SSH session to complete.

2023-03-07T15:31:53.834-0800 [ERROR] scp stderr: "Sink: C0644 219 terraform_1170819954.sh\n"

2023-03-07T15:31:53.834-0800 [DEBUG] opening new ssh session

2023-03-07T15:31:53.900-0800 [DEBUG] starting remote command: chmod 0777 /tmp/terraform_1170819954.sh

2023-03-07T15:31:53.987-0800 [DEBUG] remote command exited with '0': chmod 0777 /tmp/terraform_1170819954.sh

2023-03-07T15:31:53.987-0800 [DEBUG] opening new ssh session

2023-03-07T15:31:54.050-0800 [DEBUG] starting remote command: /tmp/terraform_1170819954.sh

module.vmscale-gcp[0].data.local_file.scaleConfigOutput: Still reading... [10s elapsed]

module.vmscale-gcp[0].data.local_file.scaleConfigStatus: Still reading... [10s elapsed]

module.vmscale-gcp[0].null_resource.cluster_config: Still creating... [10s elapsed]

module.vmscale-gcp[0].data.local_file.scaleConfigStatus: Still reading... [20s elapsed]

module.vmscale-gcp[0].data.local_file.scaleConfigOutput: Still reading... [20s elapsed]

module.vmscale-gcp[0].null_resource.cluster_config: Still creating... [20s elapsed]

module.vmscale-gcp[0].data.local_file.scaleConfigOutput: Still reading... [30s elapsed]

module.vmscale-gcp[0].data.local_file.scaleConfigStatus: Still reading... [30s elapsed]

module.vmscale-gcp[0].null_resource.cluster_config: Still creating... [30s elapsed]

2023-03-07T15:32:25.159-0800 [DEBUG] remote command exited with '0': /tmp/terraform_1170819954.sh

2023-03-07T15:32:25.159-0800 [DEBUG] opening new ssh session

2023-03-07T15:32:25.194-0800 [DEBUG] Starting remote scp process: 'scp' -vt /tmp

2023-03-07T15:32:25.230-0800 [DEBUG] Started SCP session, beginning transfers...

2023-03-07T15:32:25.230-0800 [DEBUG] Copying input data into temporary file so we can read the length

2023-03-07T15:32:25.251-0800 [DEBUG] Beginning file upload...

2023-03-07T15:32:25.293-0800 [DEBUG] SCP session complete, closing stdin pipe.

2023-03-07T15:32:25.293-0800 [DEBUG] Waiting for SSH session to complete.

2023-03-07T15:32:25.329-0800 [ERROR] scp stderr: "Sink: C0644 0 terraform_1170819954.sh\n"

module.vmscale-gcp[0].null_resource.cluster_config: Provisioning with 'local-exec'...

module.vmscale-gcp[0].null_resource.cluster_config (local-exec): Executing: ["/bin/bash" "-c" " scp -i ~/sycompacct-key -o LogLevel=error -o StrictHostKeyChecking=no -o \"UserKnownHostsFile=/dev/null\" sycompacct@34.145.97.171:/home/sycompacct/stdout /Users/neetagarimella/syscale/TF/tmp/;\n scp -i ~/sycompacct-key -o LogLevel=error -o StrictHostKeyChecking=no -o \"UserKnownHostsFile=/dev/null\" sycompacct@34.145.97.171:/home/sycompacct/exitstatus /Users/neetagarimella/syscale/TF/tmp/;\n ls -l /Users/neetagarimella/syscale/TF/tmp/ ;\n"]

module.vmscale-gcp[0].null_resource.cluster_config (local-exec): total 192

module.vmscale-gcp[0].null_resource.cluster_config (local-exec): -rw-r--r-- 1 neetagarimella staff 2 Mar 7 15:32 exitstatus

module.vmscale-gcp[0].null_resource.cluster_config (local-exec): -rw-r--r-- 1 neetagarimella staff 81222 Mar 7 15:31 scale-config-load-VERSION_STRING.tar.gz

module.vmscale-gcp[0].null_resource.cluster_config (local-exec): -rwxr-xr-x 1 neetagarimella staff 859 Mar 7 10:14 scaleParams.json

module.vmscale-gcp[0].null_resource.cluster_config (local-exec): -rwxr-xr-x 1 neetagarimella staff 1404 Mar 7 14:15 scaleParams_E1.json

module.vmscale-gcp[0].null_resource.cluster_config (local-exec): -rw-r--r-- 1 neetagarimella staff 717 Mar 7 15:32 stdout

module.vmscale-gcp[0].null_resource.cluster_config: Creation complete after 34s [id=4225292809342519822]

2023-03-07T15:32:26.472-0800 [DEBUG] provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"

2023-03-07T15:32:26.473-0800 [DEBUG] provider: plugin process exited: path=.terraform/providers/registry.terraform.io/hashicorp/null/3.2.1/darwin_amd64/terraform-provider-null_v3.2.1_x5 pid=64028

2023-03-07T15:32:26.473-0800 [DEBUG] provider: plugin exited

2023-03-07T15:32:26.473-0800 [DEBUG] states/remote: state read serial is: 23; serial is: 23

2023-03-07T15:32:26.473-0800 [DEBUG] states/remote: state read lineage is: 69fdf268-7ba3-9617-3a34-c14920b16402; lineage is: 69fdf268-7ba3-9617-3a34-c14920b16402

╷

│ Error: Read local file data source error

│

│ with module.vmscale-gcp[0].data.local_file.scaleConfigStatus,

│ on ../modules/sycomp-vmscale-gcp/data-sources.tf line 51, in data "local_file" "scaleConfigStatus":

│ 51: data "local_file" "scaleConfigStatus" {

│

│ The file at given path cannot be read.

│

│ +Original Error: open /Users/neetagarimella/syscale/TF/tmp/exitstatus: no such file or directory

╵

╷

│ Error: Read local file data source error

│

│ with module.vmscale-gcp[0].data.local_file.scaleConfigOutput,

│ on ../modules/sycomp-vmscale-gcp/data-sources.tf line 66, in data "local_file" "scaleConfigOutput":

│ 66: data "local_file" "scaleConfigOutput" {

│

│ The file at given path cannot be read.

│

│ +Original Error: open /Users/neetagarimella/syscale/TF/tmp/stdout: no such file or directory

╵

2023-03-07T15:32:27.241-0800 [DEBUG] provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"

2023-03-07T15:32:27.242-0800 [DEBUG] provider: plugin process exited: path=.terraform/providers/registry.terraform.io/hashicorp/local/2.3.0/darwin_amd64/terraform-provider-local_v2.3.0_x5 pid=64025

2023-03-07T15:32:27.242-0800 [DEBUG] provider: plugin exited

Please review the deployment results above and save them for future reference.

The file terraform_1170819954.sh exists on the remote node, though ‘0’ bytes.

[root@ng2mgmt01 tools]# ls -l /tmp/terraform_1170819954.sh
-rwxrwxrwx 1 sycompacct sycompacct 0 Mar  7 23:32 /tmp/terraform_1170819954.sh
[root@ng2mgmt01 tools]#

Hi @ngarimella,

The original “received EOF” log line you are trying to correlate with the problem is not an error, it is just a debug line from the plugin infrastructure when the process exits.

I can’t really tell what’s going on here without a complete reproducible example, but because this takes multiple runs to converge and you are getting errors from the local_file data source, it sounds like you may have problems from using a data source where it’s not appropriate.

A data source it optimally read during the plan, and only if an explicit dependency has changes is it deferred until apply. For local_file if the file is not present at the moment it is executed during plan or apply it is going to fail.

Thanks for the prompt response. The problem is very consistently recreatable. The steps are (as shown in the original post).

  1. remote_exec script fails; no local_file are created and the data block is not read as shown in the log. This is working as expected on the error path.
  2. Fix the error in the script, everything completes successfully, including the scp of the files that are read in the data block. However, TF fails to read the files.
  3. Run again and TF completes successfully.

The expected behavior is that step 2 completes without any error, requiring no third run.

If step 1 is skipped, ie no error on the first TF (clean state) run, all completes successfully. It seems like TF has some stale data/bug from the run in step 1 that is interfering on the next run.

Hope that helps to clarify.

Wondering what is error means?

One thing that stands out to me as odd here is that the second run doesn’t seem to mention random_uuid.complete at all, but I would’ve expected to see that being mentioned as “Creating…” after null_resource.cluster_config was complete and before Terraform started working on data.local_file.scaleConfigStatus.

The “third run” output mentions refreshing it, so apparently it was created at some point but I don’t see anything mentioning that it was created.

Is it possible that random_uuid.complete already existed before you started this process?

That is also not an error (it’s just verbose scp output), but perhaps it is being printed because of another error and it is in the stderr stream.

Good catch… may be that is the problem.
Just did a successful run (ie no error) and the uuid is created as expected. More details below. So may be its getting confused, when the same happens after an error has occurred. That’s really why it seems like a bug.

module.vmscale-gcp[0].null_resource.cluster_config: Provisioning with 'local-exec'...
module.vmscale-gcp[0].null_resource.cluster_config (local-exec): Executing: ["/bin/bash" "-c" "      scp -i ~/sycompacct-key -o LogLevel=error -o StrictHostKeyChecking=no -o \"UserKnownHostsFile=/dev/null\" sycompacct@35.227.155.71:/home/sycompacct/stdout /Users/neetagarimella/syscale/TF/tmp/;\n      scp -i ~/sycompacct-key -o LogLevel=error -o StrictHostKeyChecking=no -o \"UserKnownHostsFile=/dev/null\" sycompacct@35.227.155.71:/home/sycompacct/exitstatus /Users/neetagarimella/syscale/TF/tmp/;\n      ls -l /Users/neetagarimella/syscale/TF/tmp/ ;\n"]
module.vmscale-gcp[0].null_resource.cluster_config (local-exec): total 184
module.vmscale-gcp[0].null_resource.cluster_config (local-exec): -rw-r--r--  1 neetagarimella  staff      2 Mar  8 09:53 exitstatus
module.vmscale-gcp[0].null_resource.cluster_config (local-exec): -rw-r--r--  1 neetagarimella  staff  81222 Mar  8 09:48 scale-config-load-VERSION_STRING.tar.gz
module.vmscale-gcp[0].null_resource.cluster_config (local-exec): -rwxr-xr-x  1 neetagarimella  staff    861 Mar  8 09:47 scaleParams.json
module.vmscale-gcp[0].null_resource.cluster_config (local-exec): -rw-r--r--  1 neetagarimella  staff    361 Mar  8 09:53 stdout
module.vmscale-gcp[0].null_resource.cluster_config: Creation complete after 5m39s [id=441795697900981302]
module.vmscale-gcp[0].random_uuid.complete: Creating...
module.vmscale-gcp[0].random_uuid.complete: Creation complete after 0s [id=093e01a0-b501-8def-2aff-cbac523fc8c8]
module.vmscale-gcp[0].data.local_file.scaleConfigStatus: Reading...
module.vmscale-gcp[0].data.local_file.scaleConfigOutput: Reading...
module.vmscale-gcp[0].data.local_file.scaleConfigOutput: Read complete after 0s [id=d372e4dd580a1f5962dd95836b3080768e3b6796]
module.vmscale-gcp[0].data.local_file.scaleConfigStatus: Read complete after 0s [id=09d2af8dd22201dd8d48e5dcfcaed281ff9422c7]

Note : the ‘ls-l’ output was added to see why TF thinks that files are missing in step 2 of recreate. May be some race conditions that cause it read the data before random_uuid.complete or as you have alluded.

If I understand correctly, the only unexpected action here is that the data sources failed to read the files created by the null_resource.cluster_config provisioners. That would be explained by there being no relationship between either of the data sources and null_resource.cluster_config. Both local_file.scaleConfigStatus and local_file.scaleConfigOutput are going to be read immediately after random_uuid.complete regardless of the status of null_resource.cluster_config. If random_uuid.complete is already created, then the data sources will be read again during the next plan.

The reason it works sometimes is because you have the local files left over from a prior execution. Having resources depend on local files which are not checked in as part of the overall configuration can often lead to this type of confusing behavior.

@jbardin

That would be explained by there being no relationship between either of the data sources and null_resource.cluster_config.

The above is not true. There is relationship between null_resource and data sources via the random_uuid.complete. Without that it would make no sense and everything will just fall flat.

Additionally, the files referenced in data-sources are removed before a new run is started. This along with ‘random_uuid.complete’ dependency ensures stale files are not interfering.

As @apparentlymart has pointed out, there is some issue with random_uuid.complete handling in that second run, which causes this unexpected behavior. May be the uuid is getting created and data-source dependency is being resolved before files are available… some race/login condition triggered by the earlier failure.

There is no user error, else I would not have bothered to post it here. There is also no workaround since this is a awkward bug and we cant ship the product like this due to the user impact.
Is there a way to collect more data beyond what is provided to help debug this further?

Appreciate your attention and help. Thanks!

I have attached more code from the null_resource showing all of the provisioners.

resource "null_resource" "cluster_config" {
  depends_on = [
    google_compute_attached_disk.scale_attach_newdisks,
    local_file.scale_config_params
  ]
triggers = {
...
 }

 connection {
...
}
  # This provisioner is needed to clean up output files from prior run.

  provisioner "local-exec" {
    
    #Dont use '*' to delete files from localTempDir as it will remove the parameter file that was just generated.
      <<EOC2
      echo Using scale backend config code '${local.tfRootDir}/${local.sycompTarball}'.;
      #delete earlier tarball and other files;
      rm -f ${local.localTempdir}/stdout;
      rm -f ${local.localTempdir}/exitstatus;
    EOC2
    interpreter = ["/bin/bash", "-c"]
  }

  provisioner "file" {
    source      = "${local.tfScriptsDir}/scale-config-launchpad.sh"
    destination = "/home/${var.admin.sshUserName}/scale-config-launchpad.sh"
  }
 
  #
  provisioner "file" {
    source      = (var.usageType == "dev")? "${local.localTempdir}/${local.sycompTarball}" : "${local.tfRootDir}/${local.sycompTarball}"
    destination = "/home/${var.admin.sshUserName}/scale-config-load.tar.gz"
  }

  provisioner "file" {
    source      = "${local.localTempdir}/${local.scaleParmsFile}"
    destination = "/home/${var.admin.sshUserName}/${local.scaleParmsFile}"
  }

  # Using set -o errexit, to force failure if anything below fails. 
  # For 'remote-exec': on_failure applies to the last command in 'inline'
  provisioner "remote-exec" {
    inline    = [
        "set -o errexit",
        "chmod +x  /home/${var.admin.sshUserName}/scale-config-launchpad.sh",
        "/bin/sudo /home/${var.admin.sshUserName}/scale-config-launchpad.sh ${local.masterNodeFEIP} ${var.clusterDeployMode} ${local.clusterExpCounter} /home/${var.admin.sshUserName}/${local.scaleParmsFile} >/home/${var.admin.sshUserName}/stdout",

        # ---- dont need this after all if not using 'tee' to redirect stdout to a file as well as stdout. -----
        #      'tee spook' : Use of 'tee' was causing 'remote-exec'to loose the real 'RC' to set off a wild chase. 
        #      for solution,which led to dropping use of 'tee'. Now stdout is parsed to display the data to the end user via TF outputs, 
        #      so no need to capture it here. Leaving this
        #
        #      Comment block captures the captures essence of 1/2 day deep TF learning. So leaving it to avoid this trap
        #      in future.
        # ------------ Handy note on TF -------
        # Cant simply do "echo $? ><tmpdir>/exitstatus" because:
        #  - it masks the rc from scale-config-launchpad.sh and does not force an error for the provisioner
        # Instead, scale-config-launchpad.sh creates /home/${var.admin.sshUserName}/exitstatus, that used to pull the RC for 
        # this provisioner block.
        #
        # A 'cool' side effect of this is that any failure (nonzero RC will force a failure for this provisioner and null_resource.scale_config.
        # Something that was so much sought after :-) 
        # "exit $(cat /home/${var.admin.sshUserName}/exitstatus)"
    ]
    on_failure = fail    # OR continue
  }

  # Return code file 'exitstatus' on master is created by scale-config-launchpad.sh; 'stdout' was generated in 'remote_exec'
  # Gather both those file locally for further processing of the output.
  provisioner "local-exec" {
    command =  <<EOC
      scp -i ${var.admin.sshPrivatekeyPath} -o LogLevel=error -o StrictHostKeyChecking=no -o "UserKnownHostsFile=/dev/null" ${self.triggers.user}@${self.triggers.host}:/home/${self.triggers.user}/stdout ${local.localTempdir}/;
      scp -i ${var.admin.sshPrivatekeyPath} -o LogLevel=error -o StrictHostKeyChecking=no -o "UserKnownHostsFile=/dev/null" ${self.triggers.user}@${self.triggers.host}:/home/${self.triggers.user}/exitstatus ${local.localTempdir}/;
      ls -l ${local.localTempdir}/ ;
    EOC
    interpreter = ["/bin/bash", "-c"]
  } 
}

Ah you’re right, I misread the original partial config with the odd closing braces.

While there is a transitive dependency for overall ordering, there is still the issue that the data source will only be deferred in the case of the the random_uuid.complete having changes. If random_uuid.complete exists in the state, it is not going to change again regardless of what the null_resource.cluster_config is doing, so the data sources are not going to be deferred again.

(side note, the precondition is also questionable, the condition given can never be triggered, because that provider is never going to return a null value for result)

Creating a small standalone example would definitely help here if it’s possible. The data sources relying on artifacts created during apply and being read at the incorrect time is still the most likely culprit for what you are seeing.

well… will try to find time for a recreate next week or at least do the first run (error case) with debug to see it provides more clues.

@apparentlymart @jbardin
Finally, the mystery is resolved.

It is the case where script in null_resource had once completed successfully (aka failed to detect an error due to a bug). After the fix was applied and null_resource was forced (via a parameter that updates null_resource.triggers), it failed correctly but went onto read the data source files since random_uuid.complete was already created from the previously completed run as alluded by @apparentlymart.

Problem is resolved by forcing random_uuid recreate whenever null_resource is created simply by adding the keepers set to null-resource.triggers.

resource "random_uuid" "complete" {
  depends_on = [
    null_resource.cluster_config
  ]
  keepers = null_resource.cluster_config.triggers
}

On my merry way again :slight_smile:

Thanks for everyone’s attention.