Terraform resources not tracked after failure, even when successfully deployed

Hi, Hopefully someone can shed some light on this issue that I see every now and then, which has become more pressing after an Azure network deployment (using Terraform Cloud) failed towards the end. It may be that there’s no ‘fix’ as such, as my guess is that this isn’t a bug. But for my own curiosity, I’d like to understand what is happening.

Every now and then, following an error during deployment, Terraform has deployed resources successfully, but the successfully deployed resources don’t always appear in state. So when we rerun the ‘apply’, TF reports that it needs to create even the resources that are actually in Azure and operational. This isn’t on all failures, most of the time, if a TF apply fails (maybe for some weird Azure race condition or some other transient problem), any resources that have been successfully deployed, are in state and all is good when we re-run.

We have seen this recently on an Azure network deployment. 90% of it completed before it errored. Each successfully deployed resource returned the usual message of “..creation complete..” etc. And when we look in the Azure portal, they look fine.

But, they aren’t in state. We run the plan again, and all the recent changes from the root module come back as needing to be created.

The specific reason for the error on this Network deployment could well be related to how big it is. It’s big (too big), and we’re going through a process to split it out into more manageable chunks. But I have seen this behaviour on other, much smaller deployments. It’s almost as if either TF removes the entry in state for certain errors or it doesn’t actually write them to the state file in the first place (I always assumed it was at the point you see the entry in the logs xxx: Creation complete after xx seconds...

So I guess the question is why this happens so I can understand it better, rather than a fix as I don’t believe it is a bug. For the network deployment that triggered my curiosity, I’ll have to do some manual importing into state but I’m really keen to try and understand what happens, maybe we can do something to help prevent it happening in future.

Thanks in advance.

How is the apply failing? Timeout on creation?
I’d probably look at increasing the timeouts for certain types of operation on a given resource, but in terms of fixing this after it happens, the first thing I’d probably try would be to import the resource, as you’ve mentioned.