When will a pod propagate to Consul with Catalog Sync?

Hi, I’m trying to fully understand when a pod will appear in Consul. We are trying to estimate the time it takes from a deployment being created to it showing up in Consul.

I have a pod that is being added with 1 replica, and takes about 30 seconds to pass readiness checks. It also has an endpoint that gets created immediately. I initially assumed it would show up in consul immediately due to the endpoint being created so quickly, but it actually takes around 40 seconds to show up. At first I thought this was due to the sync interval so I set that to 8s (rather than the default 30s) but that didn’t change anything… Now I’m thinking it waits for the health checks to pass, but that delta still doesn’t line up.

Here’s some analysis I’ve been doing…

I wrote a script to scale to 0, wait until the service is no longer in consul catalog, then scale up and see how long it takes to show up in catalog. Here is the output, side by side with the logs (of which I’ve omitted specific details).

Script output

2021/02/23 18:27:14 Scaled down to 0
2021/02/23 18:28:08 Service removed from consul
2021/02/23 18:28:08 Scaled up to 1
2021/02/23 18:28:10 Found endpoint in K8s Endpoints
2021/02/23 18:28:34 Pod is ready in K8s
2021/02/23 18:28:51 Found service in Consul Catalog
2021/02/23 18:28:51
                  NAME                 |  TIME
---------------------------------------+----------
  Time from replica change to catalog  | 42398ms
  Time from endpoint to catalog        | 40343ms
  Time from container ready to catalog | 16153ms
  Average Request Time (error)         | 185ms

Catalog Sync Logs

2021-02-24T02:27:14.228Z [INFO]  to-consul/source: upsert endpoint: key=my-namespace/my-service
2021-02-24T02:27:16.184Z [INFO]  to-consul/sink: registering services
2021-02-24T02:27:16.185Z [INFO]  to-consul/sink: stopping service watcher: service-name=my-service-my-namespace service-consul-namespace=
2021-02-24T02:27:33.284Z [INFO]  to-consul/sink: registering services
2021-02-24T02:27:49.189Z [INFO]  to-consul/sink: registering services
2021-02-24T02:27:59.089Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=my-service-my-namespace service-consul-namespace=
2021-02-24T02:28:07.089Z [INFO]  to-consul/sink: registering services
2021-02-24T02:28:07.089Z [INFO]  to-consul/sink: deregistering service: node-name=k8s-sync service-id=my-service-f4dc43e6accd service-consul-namespace=
2021-02-24T02:28:16.587Z [INFO]  to-consul/source: upsert endpoint: key=my-namespace/my-service
2021-02-24T02:28:24.586Z [INFO]  to-consul/sink: registering services
2021-02-24T02:28:34.765Z [INFO]  to-consul/source: upsert endpoint: key=my-namespace/my-service
2021-02-24T02:28:41.879Z [INFO]  to-consul/sink: registering services
2021-02-24T02:28:41.879Z [INFO]  to-consul/sink: starting service watcher: service-name=my-service-my-namespace service-consul-namespace=
2021-02-24T02:28:59.286Z [INFO]  to-consul/sink: registering services
2021-02-24T02:29:16.694Z [INFO]  to-consul/sink: registering services

As you can see, it takes about 40 seconds for it to appear, and 16 seconds to appear even after passing readiness checks. Can anyone help me realize where are these delays coming from?

I’m also getting very inconsistent values. I’ve removed the flag that sets the Sync Period to 8s for reasons mentioned below. Here are two runs of the test where you can see time from ready is more than double the one before. We are really trying to find some consistent metric that we can rely on.

  NAME                 |  TIME
---------------------------------------+----------
  Time from replica change to catalog  | 45551ms
  Time from endpoint to catalog        | 44304ms
  Time from container ready to catalog | 22709ms
  Average Request Time (error)         | 183ms

  NAME                 |  TIME
---------------------------------------+----------
  Time from replica change to catalog  | 35994ms
  Time from endpoint to catalog        | 33752ms
  Time from container ready to catalog | 7752ms
  Average Request Time (error)         | 186ms

I also just noticed this line in the source code, alluding to some way to trigger syncs. Does anyone know where I can find more info on that? It sounds like a sync would be triggered by a deployment or scaling change, but as you can see in the first test added in this post, it takes 22 seconds for the new replica to show up! It doesn’t seem like anything was triggered at all.

This is making me wonder if dropping the SyncPeriod so drastically is incorrect…

I’ve also just noticed that while testing, I set the interval to 3s, but the loop still only happens every ~12 seconds…

  consul-k8s sync-catalog \
    ...
    -consul-write-interval=3s \
    ...

Logs

2021-02-24T17:59:56.219Z [INFO]  to-consul/sink: registering services
2021-02-24T18:00:07.415Z [INFO]  to-consul/sink: registering services
2021-02-24T18:00:19.409Z [INFO]  to-consul/sink: registering services
2021-02-24T18:00:31.904Z [INFO]  to-consul/sink: registering services
2021-02-24T18:00:43.523Z [INFO]  to-consul/sink: registering services
2021-02-24T18:00:55.614Z [INFO]  to-consul/sink: registering services
2021-02-24T18:01:07.004Z [INFO]  to-consul/sink: registering services

Hi, so unfortunately when endpoints are updated (via new deployments), the syncer doesn’t trigger immediately. It only triggers (right now) on new/deleted services. So then it relies on the “full sync” to pick up new endpoints.

Regarding the write interval, I’m not seeing that behaviour:

global:
  name: consul
server:
  replicas: 1
syncCatalog:
  enabled: true
  toK8S: false
  consulWriteInterval: "3s"
2021-03-04T23:30:41.178Z [INFO]  to-consul/sink: registering services
2021-03-04T23:30:44.369Z [INFO]  to-consul/sink: registering services
2021-03-04T23:30:47.774Z [INFO]  to-consul/sink: registering services
2021-03-04T23:30:50.893Z [INFO]  to-consul/sink: registering services
2021-03-04T23:30:53.983Z [INFO]  to-consul/sink: registering services
2021-03-04T23:30:57.077Z [INFO]  to-consul/sink: registering services
2021-03-04T23:31:00.191Z [INFO]  to-consul/sink: registering services
2021-03-04T23:31:03.378Z [INFO]  to-consul/sink: registering services

The only thing I can think of is maybe it’s still in the lock?

func (s *ConsulSyncer) syncFull(ctx context.Context) {
	s.lock.Lock()
	defer s.lock.Unlock()

	s.Log.Info("registering services")

How many services is it processing in your case?