False consul check message for service check after VM reboot

prasanthkumar2531 · August 21, 2023, 7:17am

Hi,

We are using Consul v1.14.3.

We could see false health check messages from consul agent service check after VM reboot.

Our all agents are registering their service using below json template:
{
“name”: “Service Name”,
“tags”: [“Tag”],
“address”: “container IP”,
“checks”: [
{
“args”: [“sh”, “example.sh”],
“interval”: “5s”,
“timeout”: “55s”
}

The expectation is after VM reboot until the check script “example.sh” pass the fqdn will not resolve the that particular container IP. But We could see every time that after VM reboot before agent application service comes up, from consul log we could see the service synced message. During the same time one of our client connected to this agent and request got failed since the Agent service itself dint came up completely.

Below are the agent logs from consul:

After VM reboot consul agent on application started at < 2023-08-11T11:33:21 >
2023-08-11T11:33:21.500+0530 [INFO] agent: Consul agent running!
2023-08-11T11:33:39.100+0530 [INFO] agent.client.serf.lan: serf: EventMemberJoin: Agent1 IP
2023-08-11T11:33:40.187+0530 [WARN] agent: Check is now critical: check=service:Service Name
2023-08-11T11:33:53.721+0530 [WARN] agent: Check is now critical: check=service:Service Name
2023-08-11T11:34:14.054+0530 [INFO] agent: Synced check: check=service:Service Name << Health script has not passed at this time stamp >>
2023-08-11T11:34:24.181+0530 [WARN] agent: Check is now critical: check=service:Service Name
2023-08-11T11:34:29.873+0530 [WARN] agent: Check is now critical: check=service:Service Name
2023-08-11T11:34:35.641+0530 [WARN] agent: Check is now critical: check=service:Service Name
2023-08-11T11:36:44.867+0530 [WARN] agent: Check is now critical: check=service:Service Name
2023-08-11T11:36:47.065+0530 [INFO] agent: Synced check: check=service:Service Name << Health script has not passed at this time stamp >>
2023-08-11T11:37:07.916+0530 [WARN] agent: Check is now critical: check=service:Service Name
2023-08-11T11:37:14.048+0530 [INFO] agent: Synced check: check=service:Service Name << Application check script passed only at this time stamp >>

From the above logs we can see there are two false positive messages from consul. When application quering agent fqdn and it resolved to that this node they got 500 response code since the application itself not started.

Below is the application exception during this time stamp:
Aug 11 11:36:56.699925 10.7.1.29 [AMQP Connection 10.7.1.16:5671, INFO, KnRMQConnListener, shutdownCompleted(ShutdownSignalException), Received shutdown completed event. ShutdownSignalException is , com.rabbitmq.client.ShutdownSignalException: connection error; protocol method: #method<connection.close>(reply-code=541, reply-text=INTERNAL_ERROR - Cannot declare a queue ‘queue ‘’ in vhost '’’ on node ‘Agent Node’: {‘EXIT’,{aborted,{no_exists,[rabbit_vhost,<<“*****”>>]}}}, class-id=50, method-id=10)]…

Note:

The same issue observed on all agent nodes.
There is not consul leader fluctuation during this time stamp.

blake · August 21, 2023, 4:19pm

Hi @prasanthkumar2531,

How are your applications querying Consul? Are they querying services over DNS or the HTTP interface?

prasanthkumar2531 · August 21, 2023, 5:10pm

Hi @blake

Our Applications are querying the registered services on consul over the DNS.

prasanthkumar2531 · August 22, 2023, 2:09pm

Hi @blake

Any update on this issue? Please let me know if any additional information required from my side.

blake · August 28, 2023, 11:43pm

Can you share the agent configuration for a node that is returning services in the critical health state?

That will be helpful in debugging this issue.

prasanthkumar2531 · August 29, 2023, 4:25am

Below is our agent configuration:

{
  "disable_update_check": true,
  "data_dir": "/opt/consul",
  "enable_local_script_checks": true,
  "log_level": "INFO",
  "datacenter": "<DATACENTER>",
  "bind_addr": "<CONTAINERIP>",
  "client_addr": "127.0.0.1",
  "domain": "<domain name>",
  "node_name": "<Node Name>",
  "server": false,
  "dns_config": {
    "udp_answer_limit": 1,
    "allow_stale": true,
    "node_ttl": "10s",
    "only_passing": true,
    "max_stale": "168h",
    "service_ttl": {
      "*": "10s",
      "sample_service1": "0s",
      "sample_service2": "0s",
      "sample_service3": "0s",
      "sample_service4": "0s",
      "sample_service5": "0s",
      "sample_service6": "0s"
    }
  },
  "recursors": [],
  "ports": {
    "dns": 8600
  },
  "addresses": {
    "dns": "127.0.0.1"
  },
  "limits": {
    "http_max_conns_per_client": 1000
  },
  "skip_leave_on_interrupt": false,
  "leave_on_terminate": true,
  "retry_join": ["remoteip1","remoteip2","remoteip3"]
}

Our agents are registering their service using below json template:

{
“name”: “Service Name”,
“tags”: [“Tag”],
“address”: “container IP”,
“checks”: [
{
“args”: [“sh”, “example.sh”],
“interval”: “5s”,
“timeout”: “55s”
}

consul tls file:

{
    "tls": {
        "defaults": {
            "verify_incoming": true,
            "verify_outgoing": true,
            "cert_file": "crt path",
            "key_file": "key path",
            "ca_file": "ca-bundle.crt path"
        },
        "internal_rpc": {
            "verify_server_hostname": false
        }
    }
}

prasanthkumar2531 · September 4, 2023, 4:25am

Hi @blake

Could you please let me know if we have any update on this issue?

Topic		Replies	Views
Is the health state returned by `/v1/health/checks/<service-name>` the most updated at system boot up? Consul	1	284	September 7, 2023
Consul can't register service with script check Consul	0	711	September 3, 2021
Get error: Unexpected response code: 400 (Invalid check: TTL must be > 0 for TTL checks) when register service with args check Consul health-check	6	3503	August 4, 2023
Agent restart without impacting service status? Consul	5	2540	November 20, 2023
Health for list-checks-for-service Consul	1	160	August 7, 2023

False consul check message for service check after VM reboot

Related topics