Failed to instantiate provider to obtain schema. Looking for tips to diagnose the problem

Upon running terraform plan I’m receiving these errors

Error: Failed to instantiate provider “archive” to obtain schema: Unrecognized remote plugin message:

This usually means that the plugin is either invalid or simply
needs to be recompiled to support the latest protocol.

Error: Failed to instantiate provider “google” to obtain schema: Unrecognized remote plugin message:

This usually means that the plugin is either invalid or simply
needs to be recompiled to support the latest protocol.

Error: Failed to instantiate provider “random” to obtain schema: Unrecognized remote plugin message:

This usually means that the plugin is either invalid or simply
needs to be recompiled to support the latest protocol.

The error message is a little vague which makes me think its masking the real problem. I turned on detailed logging (export TF_LOG=TRACE) and now I see:

2020/10/01 14:18:52 [TRACE] LoadSchemas: retrieving schema for provider type "random"
2020-10-01T14:18:52.973Z [DEBUG] plugin: plugin process exited: path=/root/.terraform.d/plugins/linux_amd64/terraform-provider-google_v3.37.0_x5 pid=1068
2020-10-01T14:18:52.973Z [INFO]  plugin: configuring client automatic mTLS
2020-10-01T14:18:53.007Z [DEBUG] plugin: starting plugin: path=/root/.terraform.d/plugins/linux_amd64/terraform-provider-random_v2.3.0_x4 args=[/root/.terraform.d/plugins/linux_amd64/terraform-provider-random_v2.3.0_x4]
2020-10-01T14:18:53.008Z [DEBUG] plugin: plugin started: path=/root/.terraform.d/plugins/linux_amd64/terraform-provider-random_v2.3.0_x4 pid=1075
2020-10-01T14:18:53.009Z [DEBUG] plugin: waiting for RPC address: path=/root/.terraform.d/plugins/linux_amd64/terraform-provider-random_v2.3.0_x4
2020-10-01T14:18:53.032Z [ERROR] plugin.terraform-provider-random_v2.3.0_x4: plugin init error: error="listen unix /tmp/plugin155583172: bind: input/output error" timestamp=2020-10-01T14:18:53.032Z
2020/10/01 14:18:53 [TRACE] LoadSchemas: retrieving schema for provider type "terraform"
2020-10-01T14:18:53.035Z [DEBUG] plugin: plugin process exited: path=/root/.terraform.d/plugins/linux_amd64/terraform-provider-random_v2.3.0_x4 pid=1075

So my underlying error seems to be

listen unix /tmp/plugin155583172: bind: input/output error

Haven’t got a clue where to go from here other than pasting that error into google and that hasn’t gotten me very far so far so wondering if anyone here might be able to help.

I don’t know the root cause here but I have figured out how to solve it. My terraform code is in a git repo which is hosted on our gitlab instance. It was cloned to /Users/username/git/gitlabinstancename/group/repo and the directory from which the terraform plan was being run a few other directories deep.

When I cloned the repo to /tmp everything started working. As I said, I don’t understand the root cause of the problem however I know that this fixed it so thought it was worth sharing.

Basically its an issue with the location at which I cloned the repo to and I suspect that the deep location in the directory hierarchy from where I was running the command was contributing to the problem.

1 Like

You saved me a ton of time. I’m so glad you decided to put this out for me to find. From what I can tell the problem is related to provider logging (I don’t think it is isolated to a single provider but probably the logging library used by the provider). I haven’t continued to dig, but it looks like a long enough path for the tempfile used will cause the provider to implode. You can repro by setting the TMPDIR env var to some deeply nested path.

1 Like

Did a bit more digging and it may not be logging related but IPC related. I think tf and the plugin communicate using a file opened using ioutil.TempFile("", "plugin") somewhere and this is hitting syscall: support unix-domain socket pathname longer than system defined value · Issue #6895 · golang/go · GitHub. It would be nice if they bubbled this error up a bit better, but this is my best guess atm.

Hello! I didn’t see this topic the first time around so I guess I’ll reply to it now :grimacing:

During startup of a provider plugin there is a handshake/negotation where the plugin starts a server at some endpoint that Terraform should be able to reach and then reports that address back to Terraform as part of a specially-formatted line written to its stdout.

When running on a Unix system, providers use Unix domain sockets rather than TCP sockets (which is what they’d use on e.g. Windows) and so it seems like what’s happening here is that for some reason the creation of the socket is failing and thus the plugin is failing with an error before it emits the handshake. Since there is not yet any real communication channel open between the plugin and Terraform, that error message is just written out to stderr and thus ends up in the logs, because stderr serves as the logging channel.

With that said then, I think the issue you linked could indeed potentially be related. Another cause I can imagine is that if the current temporary directory is on a filesystem that isn’t capable of storing Unix domain sockets, such as a network filesystem or a non-Unix filesystem like FAT.

It isn’t clear to me why the location of the source Terraform configuration would be important here; I’d expect it to be the location of the temporary directory that would have the most impact.

With Terraform as it currently stands then, in order for this to work correctly you’ll need to make sure that temporary files will get created in a location where it’s possible to create Unix domain sockets, within all of the constraints imposed by your host OS. If the OS imposes length limits on a socket path (which I think is typically true, since struct sockaddr_un used at the system call level usually specifies it as a fixed-length char array) then the temporary directory would need to be set to one that leads to the plugin generating a sufficiently-short path after concatenating the generated socket filename.

1 Like

Thanks for taking the time to dig into this @apparentlymart!

Do you think there’s enough information available to terraform to be able to differentiate between the schema loading error and this socket creation error? I’m not familiar enough with the plugin start up process myself, but it would be useful information to have if one were to hit this error. Granted it may be something that one wouldn’t run into often, but when you do it’s a bit obnoxious to figure out the root problem.

From Terraform’s perspective here, all it knows is:

  • The plugin didn’t produce a valid handshake message.
  • The plugin wrote something to its logs.

Although we as humans can see that the message in the logs is related to the startup error, that’s only because we understand English and can interpret the content of the message. From Terraform’s perspective, it’s an opaque string that could just as easily be some useless startup noise – several providers end up emitting logging during their startup before the handshake message, and so it would not be correct to assume that just because there was logging that it is related to the error.

This isn’t something that has come up very much – the two cases in this topic are the only examples I’ve ever heard of – so honestly I don’t really see us justifying the time it would take to introduce some way for a plugin to return an explicit machine-interpretable error as part of the handshake; although I would like to have it in an ideal world, the opportunity cost of implementing that vs. something else seems hard to justify for how infrequently this seems to arise.

One alternative path to consider here is to make the plugins try to fall back on creating a TCP socket instead of a Unix socket if the Unix socket setup fails, which might avoid the error entirely and make the process actually work. That would be a change on the plugin side rather than the Terraform Core side, so it’d be driven by changes in the SDK and then rolled out gradually to different providers over time as they make new releases. The cost/benefit of doing that also seems questionable to be honest, but at least this is something that can be implemented only on the plugin side of the protocol, without any changes to the protocol itself and thus without the need to carefully coordinate a protocol change.

Thanks for taking the time to talk through it. I figured it might be a complicated situation. Do you think this would benefit from a longer or more open discussion in a GH issue?

Aside from that, I’m not sure how best to warn or make this problem more discoverable for future me or anyone else that runs into this. This discussion might be enough especially now that it’s clear what the problem was (at least in my mind I think it’s clear) and there’s a solution.

My instinct would be to wait and see if this comes up again. Although I do understand that it’s frustrating to encounter an error and not get any feedback about what caused it, this is a problem that so far has arisen only twice in the life of Terraform (as far as we can see), is caused by something that seems to be a pretty unusual system configuration, and in both cases it would’ve been possible to resolve it quickly with help from a Terraform developer. It was only that I didn’t see this topic when it was originally posted that I didn’t answer sooner with the same response.

Hopefully now that this topic is here it’ll attract searches in case someone discovers this again in the future, and if we do find it’s more common than it first appears (e.g. because others have encountered it but didn’t speak up about it, but will now be prompted to do so by this topic) then we can re-evaluate the priority of it.

Thanks again for drawing attention to this! We’ll keep a look out for other similar problems that might change the tradeoff here.