Best way to handle MSK upgrades with zero downtime?

We are using the aws msk resources and wondering what the best way to do upgrades is without tearing down the cluster.
As it stands we have to do 3 different state operations.
1 setup new config
2 migrate to new config(upgrade if necessary cluster)
3 destroy old config

Is there a more straight forward way to go about this?
We cant just upgrade the config to a new version as terraform wants to destroy the old config and create a new config while its in use by the msk cluster and the msk cluster blocks the delete in that case.

We have run into this pattern elsewhere and are not sure how to handle it more efficiently than this. It does take at least 3 pr’s and plan/apply cycles to do these sorts of upgrades.

Hi @eedgar,

I’m not directly familiar with the Managed Kafka service and so I can’t reply from direct experience but I did take a look at the documentation for aws_msk_cluster and its associated implementation to see what options it seems to offer.

In the implementation itself I see logic for in-place-upgrading a cluster using UpdateClusterKafkaVersion, which seems to be the documented way to upgrade a cluster without destroying it.

I think that then raises the question of why the provider proposed to totally replace the cluster rather than just upgrade it in-place. I see a special rule to propose replacement only if the new version is older than the currently-selected version, which I assume is reflecting a restriction in the underlying API that you can only upgrade a cluster in-place, not switch back to an earlier version.

One thing I do notice though is that it’s making that decision about which version is older just by lexical string comparison. That means that this check wouldn’t necessarily get a correct result if the two version numbers aren’t formatted in the same way; I wonder if the particular upgrade you are trying has a pair of version numbers that this rule isn’t matching correctly and is therefore treating as a downgrade rather than an upgrade. If that seems true then I’d suggest opening a bug report about it, although I don’t see anything in the docs about only upgrades being allowed so I can’t confirm whether the rule in the provider is correct or not; I’m just guessing here as to why the upgrade might not have happened as intended.

Thanks for the feedback. I’ll check the previous and next version numbers.
We can open a bug report if that seems to be the issue.
Thanks again.
Eric