ACL replication out-of-sync issue

Hi all,
we have a Consul multi-cluster federation (v1.10.2) with 5 DCs involved, having 3 Consul servers in each.

From time to time it happens that the replication stops working for an unknown reason and some tokens/policies, which have been deleted from one DC, are still present in the others except the primary one. The only way to fix this, is to restart consul service on any of the server nodes on the affected DCs and the replication starts working again… until the same thing happens later on :unamused:

I checked /v1/acl/replication response on all servers and found out that on the affected ones the ReplicatedIndex and ReplicatedTokenIndex values differ from the working ones, but no replication error is reported, neither in the logs.

Affected DC

{
    "Enabled": true,
    "Running": true,
    "SourceDatacenter": "dc1",
    "ReplicationType": "tokens",
    "ReplicatedIndex": 8606595,
    "ReplicatedRoleIndex": 1,
    "ReplicatedTokenIndex": 8606610,
    "LastSuccess": "2021-11-30T12:36:43Z",
    "LastError": "0001-01-01T00:00:00Z"
}

Working DC

{
    "Enabled": true,
    "Running": true,
    "SourceDatacenter": "dc1",
    "ReplicationType": "tokens",
    "ReplicatedIndex": 8735625,
    "ReplicatedRoleIndex": 1,
    "ReplicatedTokenIndex": 8735398,
    "LastSuccess": "2021-12-06T14:20:02Z",
    "LastError": "0001-01-01T00:00:00Z"
}

Any idea about why this happens and how to solve it?

Thanks,
Michele

Hi @miklinux,

Consul 1.11.0 now includes a LastErrorMessage field in the /v1/acl/replication response payload (see PR #10612). If you were to upgrade your cluster, you may be able to gain a little better visibility into what is causing replication to periodically fail.

P.S. - I just realized this field has not yet been added to the docs for that API endpoint. I’ll make sure that gets addressed so that the docs correctly reflect the addition of that new field.