Consul Client Experiences Hiccups After Rejoining Consul Cluster

First of all, thank you for your support. We appreciate it.

Summary of the problem

It started to happen when we updated Consul from v1.8.9 to v1.20.x.

After the Consul client agent that is running on a Virtual Machine (VM) has gracefully left the Consul cluster (Consul servers) right before the VM is destroyed and then, on a newly created VM, it rejoins the Consul cluster again with the very same configuration (e.g., same node name), the Consul client agent has a “hiccup” as it rejoins, fails, forces the Consul client node leave, and rejoins again. This forced leave results in an undesirable downtime for the Consul node’s hostname resolution.

We are trying to both understand and avoid this behavior.

Behavior Flow

Note: We have a Consul cluster running with three Consul server agents.

t0 - VM created with Consul client agent (Consul client node)
t1 - Consul client node joins the Consul cluster for the first time - (Status alive)
t2 - Consul client node leaves intentionally. consul leave is performed before stopping the VM. - (Status leaving → left)
t3 - VM destroyed - (Status left)
t4 - New VM created with Consul client node - (Status left)
t5 - Consul client node rejoins - (Status alive)
t6 - It receives an EventMemberFailed (Status failed)
t7 - [downtime] It receives an EventMemberLeave (forced). (Status left)
t8 - Consul client node rejoins (Status alive)

Notes

  • This behavior was not faced in the v1.8.9.
  • Without the t3 and t4 we do not face this issue.
  • The Consul client node uses the same configuration during this process, so this has the same Consul node name. It does not have the same node ID because we do not set a specific ID so it is generated by Consul. See the section below for the Consul node configuration.
  • t7 does not always happen. It happens eventually. It seems to happen after the EventMemberFailed.
  • t6 does not happen when the VM created has a different IP.
  • Between t0 and t3 we do not have problems.

Questions

1- Is it a known issue?
2- Has something changed between versions 1.8.9 and 1.20.x to cause this issue? If so, do you know why it happened?
3- Could the EventMemberFailed be related to an inconsistency found in the database and the new information about the Consul client node after rejoining?
4- Could the EventMemberLeave (forced) be related to the Autopilot leaving the Consul client node because it is in a failed status?

More Information and Context

We were using Consul version 1.8.9 and we had to jump to Consul version 1.20.x. After updating it we observed that a different behavior started to happen, and our test suites started to fail because of a downtime to resolve some Consul node hostnames.

To summarize our scenario, we have a Consul cluster running with three Consul servers (three VMs). The tests are executed by provisioning a VM with a Consul client node, which then runs the tests. One of the tests attempts to resolve the hostname of the Consul client node, and once completed, the VM is deleted. This process occurs multiple times, and the issue arises between this kind of tests, specifically when rejoining the Consul client node to the Consul cluster.

Notice that this may happen not only in the tests, but also in production environments performing similar scenarios.

Log Information

Consul Node Agent (client) Log Fragment

==> Starting Consul agent...
               Version: '1.20.1'
            Build Date: '2024-10-29 19:04:05 +0000 UTC'
               Node ID: '7e83f806-3195-fb52-b7f1-e9eb811fa7f3'
             Node name: 'redis-unit-tests-dev-cdca3e9-1738784160-redis-0'
            Datacenter: 'dc1' (Segment: '')
                Server: false (Bootstrap: false)
           Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, gRPC-TLS: -1, DNS: 53)
          Cluster Addr: 10.200.162.51 (LAN: 8301, WAN: 8302)
     Gossip Encryption: true
      Auto-Encrypt-TLS: false
           ACL Enabled: false
    ACL Default Policy: allow
             HTTPS TLS: Verify Incoming: true, Verify Outgoing: true, Min Version: TLSv1_2
              gRPC TLS: Verify Incoming: true, Min Version: TLSv1_2
      Internal RPC TLS: Verify Incoming: true, Verify Outgoing: true (Verify Hostname: false), Min Version: TLSv1_2

==> Log data will now stream in as it occurs:

(...)
2025-02-12T23:13:34.635Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: redis-unit-tests-dev-cdca3e9-1738784160-redis-0 10.200.162.51
(...)
2025-02-12T23:14:56.329Z [WARN]  agent.client.memberlist.lan: memberlist: Refuting a dead message (from: redis-unit-tests-dev-cdca3e9-1738784160-redis-0)
(...)
2025-02-12T23:14:56.554Z [DEBUG] agent.client.serf.lan: serf: messageJoinType: redis-unit-tests-dev-cdca3e9-1738784160-redis-0
(...)
2025-02-12T23:14:57.025Z [DEBUG] agent.client.serf.lan: serf: messageJoinType: redis-unit-tests-dev-cdca3e9-1738784160-redis-0
(...)

Consul Server (Leader) Log Fragment

(...)
2025-02-12T23:04:58.958Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: redis-unit-tests-dev-cdca3e9-1738784160-redis-0 10.200.162.51
2025-02-12T23:04:58.959Z [INFO]  agent.server: member joined, marking health alive: member=redis-unit-tests-dev-cdca3e9-1738784160-redis-0 partition=default
2025-02-12T23:08:15.234Z [INFO]  agent.server.serf.lan: serf: EventMemberLeave: redis-unit-tests-dev-cdca3e9-1738784160-redis-0 10.200.162.51
2025-02-12T23:08:15.235Z [INFO]  agent.server: deregistering member: member=redis-unit-tests-dev-cdca3e9-1738784160-redis-0 partition=default reason=left
2025-02-12T23:13:34.642Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: redis-unit-tests-dev-cdca3e9-1738784160-redis-0 10.200.162.51
2025-02-12T23:13:34.643Z [INFO]  agent.server: member joined, marking health alive: member=redis-unit-tests-dev-cdca3e9-1738784160-redis-0 partition=default
2025-02-12T23:13:50.640Z [INFO]  agent.server.serf.lan: serf: EventMemberFailed: redis-unit-tests-dev-cdca3e9-1738784160-redis-0 10.200.162.51
**2025-02-12T23:13:50.640Z [INFO]  agent.server: member failed, marking health critical: member=redis-unit-tests-dev-cdca3e9-1738784160-redis-0 partition=default**
**2025-02-12T23:14:23.650Z [INFO]  agent.server.serf.lan: serf: EventMemberLeave (forced): redis-unit-tests-dev-cdca3e9-1738784160-redis-0 10.200.162.51**
2025-02-12T23:14:23.651Z [INFO]  agent.server: deregistering member: member=redis-unit-tests-dev-cdca3e9-1738784160-redis-0 partition=default reason=left
2025-02-12T23:14:26.729Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: redis-unit-tests-dev-cdca3e9-1738784160-redis-0 10.200.162.51
2025-02-12T23:14:26.729Z [INFO]  agent.server: member joined, marking health alive: member=redis-unit-tests-dev-cdca3e9-1738784160-redis-0 partition=default
2025-02-12T23:14:27.148Z [INFO]  agent.server.serf.lan: serf: EventMemberFailed: redis-unit-tests-dev-cdca3e9-1738784160-redis-0 10.200.162.51
2025-02-12T23:14:27.148Z [INFO]  agent.server: member failed, marking health critical: member=redis-unit-tests-dev-cdca3e9-1738784160-redis-0 partition=default
2025-02-12T23:14:46.802Z [INFO]  agent.server.serf.lan: serf: EventMemberLeave (forced): redis-unit-tests-dev-cdca3e9-1738784160-redis-0 10.200.162.51
2025-02-12T23:14:46.802Z [INFO]  agent.server: deregistering member: member=redis-unit-tests-dev-cdca3e9-1738784160-redis-0 partition=default reason=left
2025-02-12T23:14:56.528Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: redis-unit-tests-dev-cdca3e9-1738784160-redis-0 10.200.162.51
2025-02-12T23:14:56.528Z [INFO]  agent.server: member joined, marking health alive: member=redis-unit-tests-dev-cdca3e9-1738784160-redis-0 partition=default

(...)

Script Monitor Fragment

Lines: :

Notice that this script was executed to monitor this behavior by checking if it can resolve the Consul node hostname and check the Consul node status.

Wed Feb 12 11:07:57 PM UTC 2025 - IP: (10.200.162.51) - Status alive
(...)
Wed Feb 12 11:08:14 PM UTC 2025 - IP: (10.200.162.51) - Status alive
Wed Feb 12 11:08:15 PM UTC 2025 - IP: () - Status left
(...)
Wed Feb 12 11:13:33 PM UTC 2025 - IP: () - Status left
Wed Feb 12 11:13:34 PM UTC 2025 - IP: (10.200.162.51) - Status alive
(...)
Wed Feb 12 11:13:50 PM UTC 2025 - IP: (10.200.162.51) - Status alive
Wed Feb 12 11:13:51 PM UTC 2025 - IP: (10.200.162.51) - Status failed
(...)
Wed Feb 12 11:14:23 PM UTC 2025 - IP: (10.200.162.51) - Status failed
Wed Feb 12 11:14:24 PM UTC 2025 - IP: () - Status left
Wed Feb 12 11:14:25 PM UTC 2025 - IP: () - Status left
Wed Feb 12 11:14:26 PM UTC 2025 - IP: (10.200.162.51) - Status alive
Wed Feb 12 11:14:27 PM UTC 2025 - IP: (10.200.162.51) - Status failed
(...)
Wed Feb 12 11:14:45 PM UTC 2025 - IP: (10.200.162.51) - Status failed
Wed Feb 12 11:14:46 PM UTC 2025 - IP: (10.200.162.51) - Status left
Wed Feb 12 11:14:47 PM UTC 2025 - IP: () - Status left
(...)
Wed Feb 12 11:14:56 PM UTC 2025 - IP: () - Status left
Wed Feb 12 11:14:57 PM UTC 2025 - IP: (10.200.162.51) - Status alive
(...)
Wed Feb 12 11:15:07 PM UTC 2025 - IP: (10.200.162.51) - Status alive

Consul Node Configuration

{
  "data_dir": "/var/vcap/store/consul",
  "ui_config": {
    "enabled": false
  },
  "node_name": "redis-unit-tests-dev-cdca3e9-1738784160-redis-0",
  "bind_addr": "10.200.162.51",
  "client_addr": "127.0.0.1",
  "log_level": "TRACE",
  "datacenter": "dc1",
  "domain": "dsf1.abcsvc",
  "server": false,
  "rejoin_after_leave": true,
  "raft_protocol": 3,
  "dns_config": {
    "max_stale": "2h"
  },
  "service": {
    "name": "redis-unit-tests-dev-cdca3e9-1738784160",
    "tags": [
      "6d1ce0d1-96e7-4ba3-91b6-09138bd0b296"
    ]
  },
  "ports": {
    "dns": 53,
    "http": 8500
  },
  "tls": {
    "defaults": {
      "verify_incoming": true,
      "verify_outgoing": true,
      "ca_file": "/var/vcap/jobs/consul/consul/ca.cert",
      "cert_file": "/var/vcap/jobs/consul/consul/consul.cert",
      "key_file": "/var/vcap/jobs/consul/consul/consul.key"
    }
  },
  "encrypt": "naWJ5UGlzWGlnRk9OTTUwVAx=",
  "recursors": [
    "8.8.8.8"
  ],
  "retry_join": [
    "10.200.150.16",
    "10.200.151.16",
    "10.200.152.16"
  ]
}

Consul Server Configuration

{
  "data_dir": "/var/vcap/store/consul",
  "ui_config": {
    "enabled": false
  },
  "node_name": "consul-dns-consul-0",
  "bind_addr": "10.200.150.16",
  "client_addr": "127.0.0.1",
  "log_level": "TRACE",
  "datacenter": "dc1",
  "domain": "dsf1.abcsvc",
  "server": true,
  "rejoin_after_leave": true,
  "raft_protocol": 3,
  "dns_config": {
    "max_stale": "2h"
  },
  "service": {
    "name": "consul",
    "tags": [
      "279c064a-dc38-4708-ad41-243d86726e9b"
    ]
  },
  "ports": {
    "dns": 53,
    "http": 8500
  },
  "tls": {
    "defaults": {
      "verify_incoming": true,
      "verify_outgoing": true,
      "ca_file": "/var/vcap/jobs/consul/consul/ca.cert",
      "cert_file": "/var/vcap/jobs/consul/consul/consul.cert",
      "key_file": "/var/vcap/jobs/consul/consul/consul.key"
    }
  },
  "encrypt": "naWJ5UGlzWGlnRk9OTTUwVAx=",
  "recursors": [
    "8.8.8.8"
  ],
  "bootstrap_expect": 3
}

Consul Server Autopilot Configuration

$ /var/vcap/packages/consul/bin/consul operator autopilot get-config
CleanupDeadServers = true
LastContactThreshold = 200ms
MaxTrailingLogs = 250
MinQuorum = 0
ServerStabilizationTime = 10s
RedundancyZoneTag = ""
DisableUpgradeMigration = false
UpgradeVersionTag = ""
1 Like