Vault operator migrate keeps failing with `context canceled`

Hi,

I’m trying to migrate a gcs storage to raft. For this, I’m using a copy of the real bucket and a standalone (up-to-date) vault instance (the next step would be to create a raft export and reimport it in real cluster) from a compute inside same GCP project.

No matter which options I set (especially tested max-parallel from 10 to 3000) the migration always fails after ~ 6.30minutes, and always on different versions of the same secret .

I’ve checked the versions and they do not look big (~ 400 bytes).

The error I get is : Error migrating: failed to scan for children: failed to read object: context canceled

Also, I tried to restart with the -start option but always getting an error saying cluster already has configuraiton

It is telling you that a request to GCS took longer than Vault was willing to wait.

The only thing I can think of is to look for timeout settings to make Vault willing to wait for longer … however I cannot see any obviously in the documentation, and I have not worked with GCS personally.

It might be interesting to see if you can replicate the slow operation outside of Vault, using a GCS command line client, to see if it is extremely slow in isolation.

I already tried to play with VAULT_CLIENT_TIMEOUT without success. The strange thing is that it happens almost at the same place : If I log (log level info) the synced files, it always crash after ~ 59600 files (10 tests, smallest was 59547, highest 59664), no matter how long it took / how many threads are in use.

I can’t reproduce it (yet ?) with gcloud cli. Will try to sync the bucket locally and use the file backend.

So file backend does not use the exact same structure than gcs backend, back to start :frowning:

Some more details, I tried to switch to file backend for destination as the -start seems to work for it.

If I start without -start it works for some time but fails after ~ 37.5k files synced. If I then try to restart with -start set to the last sync item, it will takes ~4 minutes 20 seconds on step creating client then FAILS with Error migrating: failed to scan for children: failed to read object: context canceled without syncing anyting (or syncing only 2/3 files).

Raised an issue here : `migrate` from gcs backend is broken (context canceled / timeout) · Issue #22493 · hashicorp/vault · GitHub