Removing "-once" - worried about entire site restart at once?

We currently use envconsule's -once flag to make it start the app only once and not restart it on changes in the Consul data.

This had its own reasons back when it was introduced a few years ago, reasons which were mostly addressed since.

The current implication of this is that engineers have to remember to restart the relevant services whenever the Consul values are changed. As it happens, engineers don’t always remember to restart all the services in all the clusters and problems could be discovered way later and could be hard to correlate to specific changes.

I’m trying to push to remove the -once flag and let envconsul do its magic.

The pushback is that it will cause the entire SaaS service (over 150 services, over 400 pods in each cluster) to restart all at once if a top-level shared key is changed (e.g. something almost all services use, like a Kafka cluster, logs, monitoring as examples).

We are aware of splay but the expectation is that it can only distribute the services so much, but never guarantee that enough capacity stays up in all the “chain of processing” to serve requests, even if we expand it to, for instance, 5 minutes or even one hour.

How do others use envconsul in such circumstances?