DNS records for external services stop resolving

glitchcrab · May 29, 2022, 7:46pm

For some time now I’ve been seeing issues with a DNS record for an external service which seems to fail to resolve after some time. I’m running a 3-node Consul cluster with 3 consul-esm instances deployed too.

I created the external service with the following config:

resource "consul_node" "consul_node_vip" {
  address = local.vip_ip
  name    = "ha-lb-vip-${local.name_stub}"
  meta = {
    "external-node" : "true",
    "external-probe" : "true"
  }
}

resource "consul_service" "consul_service_vip_port" {
  name    = "ha-lb-vip-${local.name_stub}"
  address = local.vip_ip
  node    = consul_node.consul_node_vip.name
  port    = local.vip_port

  check {
    check_id = "ha-lb-vip-${local.name_stub}:${local.vip_port}"
    name     = "TCP on port ${local.vip_port}"
    tcp      = "${local.vip_ip}:${local.vip_port}"
    interval = "10s"
    timeout  = "2s"
  }
}

And once the address fails to resolve, running Terraform again will resolve the issue:

Terraform will perform the following actions:

 # consul_service.consul_service_vip_port will be created
  + resource "consul_service" "consul_service_vip_port" {
      + address    = "172.25.0.64"
      + datacenter = (known after apply)
      + id         = (known after apply)
      + name       = "ha-lb-vip-k8s-mgmt"
      + node       = "ha-lb-vip-k8s-mgmt"
      + port       = 443
      + service_id = (known after apply)

      + check {
          + check_id                          = "ha-lb-vip-k8s-mgmt:443"
          + deregister_critical_service_after = "30s"
          + interval                          = "10s"
          + method                            = "GET"
          + name                              = "TCP on port 443"
          + status                            = (known after apply)
          + tcp                               = "172.25.0.64:443"
          + timeout                           = "2s"
          + tls_skip_verify                   = false
        }
    }

Plan: 2 to add, 0 to change, 0 to destroy.

I’m not overly familiar with Consul, but when I check the UI I can see the name+IP listed under nodes (before running terraform), but I cannot see it listed under services (as is to be expected as terraform will recreate it).

What am I missing here?

Side note:

In my Consul logs I can see a fair bit of this:

2022-05-29T19:42:40.994Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=f69aab37-2cc6-2390-5f0f-19392bfc1e16
2022-05-29T19:42:47.184Z [WARN]  agent: Node info update blocked by ACLs: node=129b8f19-f14d-0153-9913-76ec745fc85f accessorID=f69aab37-2cc6-2390-5f0f-19392bfc1e16

maxb · May 29, 2022, 8:52pm

This means if the health check fails for this long, the service will be automatically be removed from Consul … probably not what you want if you’re provisioning it from Terraform, and not having the service maintain its own registration.

Part of setting up a fully functional Consul cluster with ACLs enabled, is providing each Consul agent with an agent token, which has node:write permission for its own node_name. (In less secure environments, it may be deemed sufficient to provide each agent with the same agent token, which has node:write permission for every node_name.) Doing that will resolve these messages.

glitchcrab · May 30, 2022, 7:17am

Ok, the service deregistering itself would make sense, but that would imply that the check had failed for 3 consecutive attempts? (given 10s intervals). That service points to a floating IP managed by keepalived and it’s never inaccessible for that long as I would see this in other services which connect to it.

The ACLs make sense - I thought I had that set up correctly but seemingly not.

maxb · May 30, 2022, 5:00pm

I still think a glitch in health-checking leading to the service being unregistered is the most likely issue here. Check your Consul logs around the time the service disappears - I’m fairly sure deregistrations are reported there.