We have 2 consul clusters (a testing cluster running 1.8, and a production cluster running 1.7.4) and I’ve observed an unexpected behaviour on both of them. Both of the clusters have 3 servers and several other agents, and we use ESM to monitor external services.
We noticed that only the leader of the cluster had a correct view of the service health when it changed. We caused the external service to fail the health-check and saw ESM update the service health on the catalog. However, by looking at the consul_catalog_service_node_healthy metric, we could see that only the leader considered the service unhealthy.
We looked at the raft properties of the cluster and all the servers seemed to be in-sync, with the correct leader/follower configuration.
By querying the service status over the HTTP API we noticed that followers would not answer correctly when we passed “?cached”, even though hours passed. We also noticed a high lastContact value on the response to that query:
“LastContact”: 43768357,
but the consul_raft_leader_lastContact metric showed values < 100 for every agent.
Isn’t this service health status also agreed upon by the consensus algorithm? What could be the reason for this behaviour?
We deregistered and registered one service again and the problem seems to have resolved for that service, still it is unexpected.
One thing that I noticed is that servers disagree regarding the CreateIndex and ModifyIndex of the health check.
For example in one of the follower servers:
"CreateIndex": 2623095,
"ModifyIndex": 2623095
In the leader:
"CreateIndex": 2623293,
"ModifyIndex": 3902282
I obtained these values by asking for the “stale” consistency value, so that servers return the local value. Are these values supposed to be in sync? Since ESM uses CAS semantics I think this can explain why the health is updated on the leader but not on the followers.
Thanks for the help, but this does not seem like the problem we’re having.
In our case ESM is successful in updating the health check, but the change is not propagated to the followers, it just stays on the master. The reason seems to be that the servers disagree on the ModifyIndex value of that resource and as ESM uses CAS semantics, the value is only updated on the leader, which has the ModifyIndex value equal to the CAS that ESM sent in the transaction.