Handling provider acceptance test failures when "go test -timeout" is reached

Hi everyone, I maintain the terra-farm/xenorchestra terraform provider and make heavy use of provider acceptance tests in order to verify the provider’s functionality. This year the test suite has grown to where every test suite run reaches go test’s -timeout threshold, causing the test command to panic. This results in a significant amount of manual work to ensure all the tests have run so that I can merge future PRs.

I’ve documented more background on the issue on the project’s GitHub repo (terraform-provider-xenorchestra#188). I’ve copied the initial issue’s description below, but please read the entire thread to understand the remaining context on the problem.

The current acceptance test suite has grown since the project started. It is now comprised of 30 xenorchestra_vm resource tests, which often cause the go test command to timeout (running TIMEOUT=XXXm make testacc never succeeds on its own). This is starting to become a significant time sink on development velocity. In addition, the process listed below is time consuming and requires that I correctly identify what tests failed or were skipped:

  1. Run test suite via make testacc
  2. Identify which tests failed and which were skipped (as in go test timed out before that test was run). These are indicated by tests that were labeled as PAUSE’ed in go test’s output but never resumed.
  3. Run additional test commands only running the failed tests TEST='Test1|Test2|Test3|....' make testacc
  4. Repeat 2-3 until the test suite passes

There have been past efforts to improve the test suite quality:

While these have improved the test suite, it does not help with managing the test suite over time as its performance changes (test suite becomes slower, certain tests become problematic, identifying bad tests).

The goal of this issue is to allow the test suite to pass by running a single bash command. This will prevent the frequent loop identified above where commands are issued until all acceptance tests have passed.

One possible idea is to explore alternative ways of running the test suite (shell script, more proper test runner). It seems that gotestsum test runner is worth exploring. It has builtin support for re-running failed tests and can also identify what tests are slow.

I see two ways of attacking this problem:

  • Address issue that is causing tests to never finish – I assume there is something that is causing deadlocking or paused tests to stall
  • Create a test runner that is capable of interpreting the “panic” output of go test and rerunning the remaining tests to reduce the manual work to run the test suite fully

I’ve started on option 2 since my attempts to profile the test suite haven’t been successful. However, I’m seeking guidance for how other provider developers are handling this issue, tips for profiling these tests to find the bottleneck and any other help you can provide.