Ghost agent node - no serf check but still registered

We’re using Consul for our AWS ECS services, and things have been running pretty smoothly. We’ve got a Consul cluster with 5 nodes on EC2, and each ECS task fires up a main container with our business logic and a Consul sidecar. This sidecar registers the main service when it starts up. We’ve got a router with its own Consul sidecar using Consul DNS to find these services.

The issue is sometimes we see ghost nodes with services after they were meant to have deregistered from the cluster. A node recieves a sigterm and then usually deregisters successfully. For the ghost node, it apears to deregisters successfully from looking at the agent logs but when we look at the service, the node is still registered to the service. The service health check is reporting fine but there is no serfHealth check.

curl http://127.0.0.1:8500/v1/health/node/synthetic-app-123-5144440f3efd4f2c | jq .

[
  {
    "Node": "synthetic-app-123-5144440f3efd4f2c",
    "CheckID": "process-check",
    "Name": "Cage Process Check",
    "Status": "passing",
    "Notes": "",
    "Output": "HTTP GET http://localhost:3032: 200 OK Output: {\"controlPlane\":{\"status\":\"Ok\",\"message\":\"Control plane is running\"},\"dataPlane\":{\"status\":\"Ok\",\"message\":null}}",
    "ServiceID": "synthetic-app-123 ",
    "ServiceName": "synthetic-app-123",
    "ServiceTags": [
      "synthetic-app-123-5144440f3efd4f2c"
    ],
    "Type": "http",
    "Interval": "3s",
    "Timeout": "1s",
    "ExposedPort": 0,
    "Definition": {},
    "CreateIndex": 715741,
    "ModifyIndex": 715741
  }
]

Note missing serfHealth.

When it happens, a portion of requests end up going to a dead container because the DNS query spits out the IP of the ghost node. Normally, each service has two tasks, so that means two Consul sidecar nodes get registered. But during this issue, three nodes show up like so:


[
    {
      "ID": "002769a4-514d-8492-bdcb-b64e927977d4",
      "Node": "synthetic-app-123-5144440f3efd4f2c",
      "Address": "10.1.175.70",
      "Datacenter": "us-east-1",
      "TaggedAddresses": {
        "lan": "10.1.175.70",
        "lan_ipv4": "10.1.175.70",
        "wan": "10.1.175.70",
        "wan_ipv4": "10.1.175.70"
      },
      "NodeMeta": {
        "consul-network-segment": ""
      },
      "ServiceKind": "",
      "ServiceID": "synthetic-app-123",
      "ServiceName": "synthetic-app-123",
      "ServiceTags": [
        "synthetic-app-123"
      ],
      "ServiceAddress": "",
      "ServiceWeights": {
        "Passing": 1,
        "Warning": 1
      },
      "ServiceMeta": {},
      "ServicePort": 443,
      "ServiceSocketPath": "",
      "ServiceEnableTagOverride": false,
      "ServiceProxy": {
        "Mode": "",
        "MeshGateway": {},
        "Expose": {}
      },
      "ServiceConnect": {},
      "CreateIndex": 715741,
      "ModifyIndex": 715741
    },
    {
      "ID": "b10b81b9-c26c-fa72-91d5-c917175f69e4",
      "Node": "synthetic-app-123-5c2be7c522ac4c77",
      "Address": "10.1.159.202",
      "Datacenter": "us-east-1",
      "TaggedAddresses": {
        "lan": "10.1.159.202",
        "lan_ipv4": "10.1.159.202",
        "wan": "10.1.159.202",
        "wan_ipv4": "10.1.159.202"
      },
      "NodeMeta": {
        "consul-network-segment": ""
      },
      "ServiceKind": "",
      "ServiceID": "synthetic-app-123",
      "ServiceName": "synthetic-app-123",
      "ServiceTags": [
        "synthetic-app-123"
      ],
      "ServiceAddress": "",
      "ServiceWeights": {
        "Passing": 1,
        "Warning": 1
      },
      "ServiceMeta": {},
      "ServicePort": 443,
      "ServiceSocketPath": "",
      "ServiceEnableTagOverride": false,
      "ServiceProxy": {
        "Mode": "",
        "MeshGateway": {},
        "Expose": {}
      },
      "ServiceConnect": {},
      "CreateIndex": 771968,
      "ModifyIndex": 771968
    },
    {
      "ID": "ed580aad-8562-2347-0113-5c252eb68d70",
      "Node": "synthetic-app-123-6f27a5a8f36e4185",
      "Address": "10.1.190.50",
      "Datacenter": "us-east-1",
      "TaggedAddresses": {
        "lan": "10.1.190.50",
        "lan_ipv4": "10.1.190.50",
        "wan": "10.1.190.50",
        "wan_ipv4": "10.1.190.50"
      },
      "NodeMeta": {
        "consul-network-segment": ""
      },
      "ServiceKind": "",
      "ServiceID": "synthetic-app-123",
      "ServiceName": "synthetic-app-123",
      "ServiceTags": [
        "synthetic-app-123"
      ],
      "ServiceAddress": "",
      "ServiceWeights": {
        "Passing": 1,
        "Warning": 1
      },
      "ServiceMeta": {},
      "ServicePort": 443,
      "ServiceSocketPath": "",
      "ServiceEnableTagOverride": false,
      "ServiceProxy": {
        "Mode": "",
        "MeshGateway": {},
        "Expose": {}
      },
      "ServiceConnect": {},
      "CreateIndex": 772011,
      "ModifyIndex": 772011
    }
  ]

here’s the config of the agent:

{
    "advertise_reconnect_timeout": "2s",
    "data_dir": "/var/consul/",
    "datacenter": "us-east-1",
    "dns_config": {
       "a_record_limit": 1
    },
    "domain": "consul",
    "enable_central_service_config": true,
    "enable_script_checks": false,
    "encrypt": "[key]",
    "log_level": "INFO",
    "node_name": "synthetic-app-123-5c2be7c522ac4c77",
    "ports": {
       "dns": 8600,
       "http": 8500,
       "https": 8443
    },
    "retry_join": [
       "provider=aws tag_key=Name tag_value=consul"
    ],
    "retry_max": 2,
    "server": false,
    "service": [
       {
          "check": {
             "deregister_critical_service_after": "1s",
             "header": {
                "User-Agent": [
                   "ECS-HealthCheck"
                ]
             },
             "http": "http://localhost:3032",
             "id": "process-check",
             "interval": "3s",
             "method": "GET",
             "name": "Cage Process Check",
             "success_before_passing": 4,
             "timeout": "1s",
             "tls_skip_verify": true
          },
          "id": "synthetic-app-123",
          "name": "synthetic-app-123",
          "port": 443,
          "tags": [
            "synthetic-app-123"
          ]
       }
    ]
 }

Logs from ghost node agent when shutting down:

2023-11-17T13:11:24.915Z [INFO] agent: Caught: signal=terminated
2023-11-17T13:11:24.915Z [INFO] agent: Gracefully shutting down agent...
2023-11-17T13:11:24.915Z [INFO] agent.client: client starting leave
2023-11-17T13:11:25.126Z [INFO] agent.client.serf.lan: serf: EventMemberLeave: synthetic-app-123-5144440f3efd4f2c 10.1.175.70
2023-11-17T13:11:28.245Z [INFO] agent: Synced node info
2023-11-17T13:11:28.247Z [INFO] agent: Synced service: service=synthetic-app-123
2023-11-17T13:11:28.726Z [INFO] agent: Graceful exit completed
2023-11-17T13:11:28.726Z [INFO] agent: Requesting shutdown
2023-11-17T13:11:28.726Z [INFO] agent.client: shutting down client
2023-11-17T13:11:28.730Z [INFO] agent: consul client down
2023-11-17T13:11:28.730Z [INFO] agent: shutdown complete
2023-11-17T13:11:28.730Z [INFO] agent: Stopping server: protocol=DNS address=127.0.0.1:8600 network=tcp
2023-11-17T13:11:28.730Z [INFO] agent: Stopping server: protocol=DNS address=127.0.0.1:8600 network=udp
2023-11-17T13:11:28.730Z [INFO] agent: Stopping server: address=127.0.0.1:8443 network=tcp protocol=https
2023-11-17T13:11:28.730Z [INFO] agent: Stopping server: address=127.0.0.1:8500 network=tcp protocol=http
2023-11-17T13:11:28.730Z [INFO] agent: Waiting for endpoints to shut down
2023-11-17T13:11:28.730Z [INFO] agent: Endpoints down
2023-11-17T13:11:28.730Z [INFO] agent: Exit code: code=0

Would anyone have any idea what’s going on here?

What version of Consul and Consul-ECS are you running? This sounds like a known bug that, I think, we have fixed.

Hi Jeff, thanks for getting back! Both the server and the agents are using consul 1.15.6.

Also just to clarify, we’re not using the specific Consul-ECS image from docker. The ECS consul sidecar we use is just a Dockerfile which, when on startup, generates a dynamic config (the agent config in the original post) and then runs the consul agent with that generated config.

Just writing back here to confirm that updating the consul version has appeared to fix the issue. We haven’t encountered it since upgrading to 1.17.0. Thanks so much for the help @Jeff-Apple.