Hi all,
we have a Consul multi-cluster federation (v1.10.2) with 5 DCs involved, having 3 Consul servers in each.
From time to time it happens that the replication stops working for an unknown reason and some tokens/policies, which have been deleted from one DC, are still present in the others except the primary one. The only way to fix this, is to restart consul service on any of the server nodes on the affected DCs and the replication starts working again… until the same thing happens later on
I checked /v1/acl/replication response on all servers and found out that on the affected ones the ReplicatedIndex and ReplicatedTokenIndex values differ from the working ones, but no replication error is reported, neither in the logs.
Affected DC
{
"Enabled": true,
"Running": true,
"SourceDatacenter": "dc1",
"ReplicationType": "tokens",
"ReplicatedIndex": 8606595,
"ReplicatedRoleIndex": 1,
"ReplicatedTokenIndex": 8606610,
"LastSuccess": "2021-11-30T12:36:43Z",
"LastError": "0001-01-01T00:00:00Z"
}
Working DC
{
"Enabled": true,
"Running": true,
"SourceDatacenter": "dc1",
"ReplicationType": "tokens",
"ReplicatedIndex": 8735625,
"ReplicatedRoleIndex": 1,
"ReplicatedTokenIndex": 8735398,
"LastSuccess": "2021-12-06T14:20:02Z",
"LastError": "0001-01-01T00:00:00Z"
}
Any idea about why this happens and how to solve it?
Thanks,
Michele