ACL replication out-of-sync issue

miklinux · December 6, 2021, 2:26pm

Hi all,
we have a Consul multi-cluster federation (v1.10.2) with 5 DCs involved, having 3 Consul servers in each.

From time to time it happens that the replication stops working for an unknown reason and some tokens/policies, which have been deleted from one DC, are still present in the others except the primary one. The only way to fix this, is to restart consul service on any of the server nodes on the affected DCs and the replication starts working again… until the same thing happens later on

I checked /v1/acl/replication response on all servers and found out that on the affected ones the ReplicatedIndex and ReplicatedTokenIndex values differ from the working ones, but no replication error is reported, neither in the logs.

Affected DC

{
    "Enabled": true,
    "Running": true,
    "SourceDatacenter": "dc1",
    "ReplicationType": "tokens",
    "ReplicatedIndex": 8606595,
    "ReplicatedRoleIndex": 1,
    "ReplicatedTokenIndex": 8606610,
    "LastSuccess": "2021-11-30T12:36:43Z",
    "LastError": "0001-01-01T00:00:00Z"
}

Working DC

{
    "Enabled": true,
    "Running": true,
    "SourceDatacenter": "dc1",
    "ReplicationType": "tokens",
    "ReplicatedIndex": 8735625,
    "ReplicatedRoleIndex": 1,
    "ReplicatedTokenIndex": 8735398,
    "LastSuccess": "2021-12-06T14:20:02Z",
    "LastError": "0001-01-01T00:00:00Z"
}

Any idea about why this happens and how to solve it?

Thanks,
Michele

blake · January 4, 2022, 9:38am

Hi @miklinux,

Consul 1.11.0 now includes a LastErrorMessage field in the /v1/acl/replication response payload (see PR #10612). If you were to upgrade your cluster, you may be able to gain a little better visibility into what is causing replication to periodically fail.

P.S. - I just realized this field has not yet been added to the docs for that API endpoint. I’ll make sure that gets addressed so that the docs correctly reflect the addition of that new field.

Topic		Replies	Views
Consul Replicate Permission Error Consul	2	736	December 21, 2021
Can AcLs not be replicated in a multi-datacenter federation? Consul acl	2	793	January 2, 2021
Cannot update ACL token in a different datacenter Consul acl	1	845	May 26, 2020
Federation State RPC errors Consul	1	704	July 13, 2021
HCSEC-2020-11 - Consul Legacy ACL Permission Changes Not Propagated to Secondary Datacenters Security security-consul	0	4146	November 25, 2020

ACL replication out-of-sync issue

Related topics