Dealing with large amounts of node churn

Our deploy process in k8s currently rolls all of our deployments and statefulsets at once. Each of these pods has a sidecar consul container that registers each pod as a node in consul. Thus, when we deploy, there’s a huge churn in members leaving/joining (about 6000 in 30-45 minutes), and we start to see weird behavior in consul.

We see:

  • Weird Consul Join/Leave Patterns. We have leave on terminate set to true, but our logs of the consul-server leader show EventMemberFailed->memberlist: Conflicting address->EventMemberJoin (for new ip of the pod, as expected)->deregistering member: member=xxxx reason=left (This deregistered the new pod, even though it was up and going just fine, and never left)->EventMemberJoin (no new container, it just joins again because it had been deregistered earlier for some reason?)

  • Phantom joins in Consul, where consul-server is listing a EventMemberJoin for a pod+ip combo long after that pod+ip had been destroyed.

There’s a ton of cpu/RAM/ on the server that consul could use, so resource contention doesn’t appear to be an issue. GC isn’t significantly increasing. CPU never goes higher than 30%. There’s nothing in the logs.

Is there anything I can tweak in the consul server config to make it deal with this large churn better?