Consul servers disagree regarding service health

edevil · July 22, 2020, 9:23am

Hello.

We have 2 consul clusters (a testing cluster running 1.8, and a production cluster running 1.7.4) and I’ve observed an unexpected behaviour on both of them. Both of the clusters have 3 servers and several other agents, and we use ESM to monitor external services.

We noticed that only the leader of the cluster had a correct view of the service health when it changed. We caused the external service to fail the health-check and saw ESM update the service health on the catalog. However, by looking at the consul_catalog_service_node_healthy metric, we could see that only the leader considered the service unhealthy.

We looked at the raft properties of the cluster and all the servers seemed to be in-sync, with the correct leader/follower configuration.

By querying the service status over the HTTP API we noticed that followers would not answer correctly when we passed “?cached”, even though hours passed. We also noticed a high lastContact value on the response to that query:

“LastContact”: 43768357,

but the consul_raft_leader_lastContact metric showed values < 100 for every agent.

Isn’t this service health status also agreed upon by the consensus algorithm? What could be the reason for this behaviour?

We deregistered and registered one service again and the problem seems to have resolved for that service, still it is unexpected.

Best regards.

edevil · July 22, 2020, 3:05pm

One thing that I noticed is that servers disagree regarding the CreateIndex and ModifyIndex of the health check.

For example in one of the follower servers:

"CreateIndex": 2623095,
"ModifyIndex": 2623095

In the leader:

"CreateIndex": 2623293,
"ModifyIndex": 3902282

I obtained these values by asking for the “stale” consistency value, so that servers return the local value. Are these values supposed to be in sync? Since ESM uses CAS semantics I think this can explain why the health is updated on the leader but not on the followers.

npearce · July 23, 2020, 7:28pm

Hi! I’m wondering if this matches the experience you’re having?

Requires ‘consistent’ consistency mode, documented here:

To test you could build a new ESM binary from https://github.com/hashicorp/consul-esm/tree/master

Thoughts?

edevil · July 24, 2020, 10:17am

Thanks for the help, but this does not seem like the problem we’re having.

In our case ESM is successful in updating the health check, but the change is not propagated to the followers, it just stays on the master. The reason seems to be that the servers disagree on the ModifyIndex value of that resource and as ESM uses CAS semantics, the value is only updated on the leader, which has the ModifyIndex value equal to the CAS that ESM sent in the transaction.

Topic		Replies	Views
Is the health state returned by `/v1/health/checks/<service-name>` the most updated at system boot up? Consul	1	248	September 7, 2023
Consul Connect and service health/tags question Consul	2	424	March 26, 2021
Consul-esm operational mode for multi instance Consul consul-esm	1	480	September 17, 2021
Consul health endpoint for health check of Consul itself Consul	7	5018	May 12, 2021
Unable to make fault-tolerant 5 node Consul server setup Consul k8s , raft , consul	5	429	November 14, 2022

Consul servers disagree regarding service health

Related topics