False consul check message for service check after VM reboot

Hi,

We are using Consul v1.14.3.

We could see false health check messages from consul agent service check after VM reboot.

Our all agents are registering their service using below json template:
{
“name”: “Service Name”,
“tags”: [“Tag”],
“address”: “container IP”,
“checks”: [
{
“args”: [“sh”, “example.sh”],
“interval”: “5s”,
“timeout”: “55s”
}

The expectation is after VM reboot until the check script “example.sh” pass the fqdn will not resolve the that particular container IP. But We could see every time that after VM reboot before agent application service comes up, from consul log we could see the service synced message. During the same time one of our client connected to this agent and request got failed since the Agent service itself dint came up completely.

Below are the agent logs from consul:

After VM reboot consul agent on application started at < 2023-08-11T11:33:21 >
2023-08-11T11:33:21.500+0530 [INFO] agent: Consul agent running!
2023-08-11T11:33:39.100+0530 [INFO] agent.client.serf.lan: serf: EventMemberJoin: Agent1 IP
2023-08-11T11:33:40.187+0530 [WARN] agent: Check is now critical: check=service:Service Name
2023-08-11T11:33:53.721+0530 [WARN] agent: Check is now critical: check=service:Service Name
2023-08-11T11:34:14.054+0530 [INFO] agent: Synced check: check=service:Service Name << Health script has not passed at this time stamp >>
2023-08-11T11:34:24.181+0530 [WARN] agent: Check is now critical: check=service:Service Name
2023-08-11T11:34:29.873+0530 [WARN] agent: Check is now critical: check=service:Service Name
2023-08-11T11:34:35.641+0530 [WARN] agent: Check is now critical: check=service:Service Name
2023-08-11T11:36:44.867+0530 [WARN] agent: Check is now critical: check=service:Service Name
2023-08-11T11:36:47.065+0530 [INFO] agent: Synced check: check=service:Service Name << Health script has not passed at this time stamp >>
2023-08-11T11:37:07.916+0530 [WARN] agent: Check is now critical: check=service:Service Name
2023-08-11T11:37:14.048+0530 [INFO] agent: Synced check: check=service:Service Name << Application check script passed only at this time stamp >>

From the above logs we can see there are two false positive messages from consul. When application quering agent fqdn and it resolved to that this node they got 500 response code since the application itself not started.

Below is the application exception during this time stamp:
Aug 11 11:36:56.699925 10.7.1.29 [AMQP Connection 10.7.1.16:5671, INFO, KnRMQConnListener, shutdownCompleted(ShutdownSignalException), Received shutdown completed event. ShutdownSignalException is , com.rabbitmq.client.ShutdownSignalException: connection error; protocol method: #method<connection.close>(reply-code=541, reply-text=INTERNAL_ERROR - Cannot declare a queue ‘queue ‘’ in vhost '’’ on node ‘Agent Node’: {‘EXIT’,{aborted,{no_exists,[rabbit_vhost,<<“*****”>>]}}}, class-id=50, method-id=10)]…

Note:

  • The same issue observed on all agent nodes.
  • There is not consul leader fluctuation during this time stamp.

Hi @prasanthkumar2531,

How are your applications querying Consul? Are they querying services over DNS or the HTTP interface?

Hi @blake

Our Applications are querying the registered services on consul over the DNS.

Hi @blake

Any update on this issue? Please let me know if any additional information required from my side.

Can you share the agent configuration for a node that is returning services in the critical health state?

That will be helpful in debugging this issue.

Below is our agent configuration:

{
  "disable_update_check": true,
  "data_dir": "/opt/consul",
  "enable_local_script_checks": true,
  "log_level": "INFO",
  "datacenter": "<DATACENTER>",
  "bind_addr": "<CONTAINERIP>",
  "client_addr": "127.0.0.1",
  "domain": "<domain name>",
  "node_name": "<Node Name>",
  "server": false,
  "dns_config": {
    "udp_answer_limit": 1,
    "allow_stale": true,
    "node_ttl": "10s",
    "only_passing": true,
    "max_stale": "168h",
    "service_ttl": {
      "*": "10s",
      "sample_service1": "0s",
      "sample_service2": "0s",
      "sample_service3": "0s",
      "sample_service4": "0s",
      "sample_service5": "0s",
      "sample_service6": "0s"
    }
  },
  "recursors": [],
  "ports": {
    "dns": 8600
  },
  "addresses": {
    "dns": "127.0.0.1"
  },
  "limits": {
    "http_max_conns_per_client": 1000
  },
  "skip_leave_on_interrupt": false,
  "leave_on_terminate": true,
  "retry_join": ["remoteip1","remoteip2","remoteip3"]
}

Our agents are registering their service using below json template:

{
“name”: “Service Name”,
“tags”: [“Tag”],
“address”: “container IP”,
“checks”: [
{
“args”: [“sh”, “example.sh”],
“interval”: “5s”,
“timeout”: “55s”
}

consul tls file:

{
    "tls": {
        "defaults": {
            "verify_incoming": true,
            "verify_outgoing": true,
            "cert_file": "crt path",
            "key_file": "key path",
            "ca_file": "ca-bundle.crt path"
        },
        "internal_rpc": {
            "verify_server_hostname": false
        }
    }
}

Hi @blake

Could you please let me know if we have any update on this issue?