We have 2 consul clusters (a testing cluster running 1.8, and a production cluster running 1.7.4) and I’ve observed an unexpected behaviour on both of them. Both of the clusters have 3 servers and several other agents, and we use ESM to monitor external services.
We noticed that only the leader of the cluster had a correct view of the service health when it changed. We caused the external service to fail the health-check and saw ESM update the service health on the catalog. However, by looking at the consul_catalog_service_node_healthy metric, we could see that only the leader considered the service unhealthy.
We looked at the raft properties of the cluster and all the servers seemed to be in-sync, with the correct leader/follower configuration.
By querying the service status over the HTTP API we noticed that followers would not answer correctly when we passed “?cached”, even though hours passed. We also noticed a high lastContact value on the response to that query:
but the consul_raft_leader_lastContact metric showed values < 100 for every agent.
Isn’t this service health status also agreed upon by the consensus algorithm? What could be the reason for this behaviour?
We deregistered and registered one service again and the problem seems to have resolved for that service, still it is unexpected.