Services not deregister after Nomad stop job

Bonjour :wave:

Nomad 1.5
Consul 1.15.1

I tried to install hashistack Consul, Vault, Nomad with ACL and TLS. I’m near of success! :pensive:

But I have problem with services in consul.

When I stop a job, service is not deregister and If I restart tasks, I get an additionals instances in Consul.

I tried :

  • Deregister with Consul CLI, but very often, I have an 404 answer.
  • Restart Consul, Nomad
  • Call an exhorcist
  • nomad system gc
  • Read others topic and try
  • Purge job but the service remains…

I use this simple job tester:

job "tester" {
  region = "global"
  datacenters = ["dc1"]
  type = "service"

  group "tester" {

    count = 1

    restart {
      attempts = 10
      interval = "5m"
      delay = "10s"
      mode = "delay"
    }

    network {
      mode = "bridge"
    }

    service {
      name = "mesh"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "redis"
              local_bind_port  = "6379"
            }
          }
        }
        sidecar_task {
          resources {
            cpu    = 16
            memory = 16
          }
        }
      }
    }

    task "tester" {
      driver = "docker"

      config {
        image = "alpine:latest"
        entrypoint = ["/bin/sleep", "3600"]
      }

      resources {
        cpu    = 128
        memory = 128
      }
    }
  }
}

I can reproduce this problem on differents VM.
So It’s problem with my configuration, but which configuration? Nomad? Consul? The World?

Nomad:

name = "pc"
region = "global"
datacenter = "dc1"

enable_debug = false
disable_update_check = false


bind_addr = "0.0.0.0"
advertise {
    http = "192.168.10.107:4646"
    rpc = "192.168.10.107:4647"
    serf = "192.168.10.107:4648"
}
ports {
    http = 4646
    rpc = 4647
    serf = 4648
}

consul {
    # The address to the Consul agent.
    address = "127.0.0.1:8501"
    grpc_address = "127.0.0.1:8502"
    ssl = true
    ca_file = "/etc/ssl/hashistack/hashistack-ca.pem"
    cert_file = "/etc/ssl/hashistack/dc1-server-consul.pem"
    key_file = "/etc/ssl/hashistack/dc1-server-consul.key"
    token = "zdczeczczc-49-6107-920a84d349d3"
    # The service name to register the server and client with Consul.
    server_service_name = "nomad-servers"
    client_service_name = "nomad-clients"
    tags = {}

    # Enables automatically registering the services.
    auto_advertise = true

    # Enabling the server and client to bootstrap using Consul.
    server_auto_join = true
    client_auto_join = true
}

data_dir = "/var/nomad"

log_level = "INFO"
enable_syslog = true

leave_on_terminate = true
leave_on_interrupt = false

tls {
    http = true
    rpc = true
    ca_file = "/etc/ssl/hashistack/hashistack-ca.pem"
    cert_file = "/etc/ssl/hashistack/dc1-server-nomad.pem"
    key_file = "/etc/ssl/hashistack/dc1-server-nomad.key"
    rpc_upgrade_mode = true
    verify_server_hostname = "true"
    verify_https_client = "false"
}

acl {
    enabled = true
    token_ttl = "30s"
    policy_ttl = "30s"
    replication_token = ""
}

telemetry {
    disable_hostname = "false"
    collection_interval = "1s"
    use_node_name = "false"
    publish_allocation_metrics = "false"
    publish_node_metrics = "false"
    filter_default = "true"
    prefix_filter = []
    disable_dispatched_job_summary_metrics = "false"
    statsite_address = ""
    statsd_address = ""
    datadog_address = ""
    datadog_tags = []
    prometheus_metrics = "true"
    circonus_api_token = ""
    circonus_api_app = "nomad"
    circonus_api_url = "https://api.circonus.com/v2"
    circonus_submission_interval = "10s"
    circonus_submission_url = ""
    circonus_check_id = ""
    circonus_check_force_metric_activation = "false"
    circonus_check_instance_id = ""
    circonus_check_search_tag = ""
    circonus_check_display_name = ""
    circonus_check_tags = ""
    circonus_broker_id = ""
    circonus_broker_select_tag = ""
}

autopilot {
    cleanup_dead_servers      = true
    last_contact_threshold    = "200ms"
    max_trailing_logs         = 250
    server_stabilization_time = "10s"
}

Consul

{
    "acl": {
        "default_policy": "deny",
        "down_policy": "extend-cache",
        "enable_token_persistence": true,
        "enabled": true,
        "token_ttl": "30s",
        "tokens": {
            "initial_management": "698eeb8c-3137-54df-aa9a-19ab58f1e4e8",
            "replication": "7adcd89e-f121-571b-ba33-79c14235b17c"
        }
    },
    "addresses": {
        "dns": "0.0.0.0",
        "grpc": "0.0.0.0",
        "http": "0.0.0.0",
        "https": "0.0.0.0"
    },
    "advertise_addr": "192.168.10.107",
    "advertise_addr_wan": "192.168.10.107",
    "auto_encrypt": {},
    "autopilot": {
        "cleanup_dead_servers": false,
        "last_contact_threshold": "200ms",
        "max_trailing_logs": 250,
        "server_stabilization_time": "10s"
    },
    "bind_addr": "192.168.10.107",
    "bootstrap": false,
    "bootstrap_expect": 1,
    "client_addr": "127.0.0.1",
   "connect": {
        "enabled": true
    },
    "data_dir": "/opt/consul",
    "datacenter": "dc1",
    "disable_update_check": false,
    "domain": "consul",
    "enable_local_script_checks": false,
    "enable_script_checks": false,
    "encrypt": "LfPG/ZrttURHHihjqHxsPzPTSUPX9N45F4OALhPYwtQ=",
    "encrypt_verify_incoming": true,
    "encrypt_verify_outgoing": true,
    "log_file": "/var/log/consul/consul.log",
    "log_level": "INFO",
    "log_rotate_bytes": 0,
    "log_rotate_duration": "24h",
    "log_rotate_max_files": 0,
    "performance": {
        "leave_drain_time": "5s",
        "raft_multiplier": 1,
        "rpc_hold_timeout": "7s"
    },
    "ports": {
        "dns": 8600,
        "grpc": 8502,
        "grpc_tls": 8503,
        "http": -1,
        "https": 8501,
        "serf_lan": 8301,
        "serf_wan": 8302,
        "server": 8300
    },
    },
    "primary_datacenter": "dc1",
    "raft_protocol": 3,
    "retry_interval": "30s",
    "retry_interval_wan": "30s",
    "retry_join": [
        "192.168.10.107"
    ],
    "retry_max": 0,
    "retry_max_wan": 0,
    "server": true,
    "tls": {
        "defaults": {
            "ca_file": "/etc/ssl/hashistack/hashistack-ca.pem",
            "cert_file": "/etc/ssl/hashistack/dc1-server-consul.pem",
            "key_file": "/etc/ssl/hashistack/dc1-server-consul.key",
            "tls_min_version": "TLSv1_2",
            "verify_incoming": true,
            "verify_outgoing": true
        },
        "https": {
            "verify_incoming": false
        },
        "internal_rpc": {
            "verify_incoming": true,
            "verify_server_hostname": true
        }
    },
    "translate_wan_addrs": false,
    "ui_config": {
        "enabled": true
    }
}

:ring_buoy:
If you have advice?
Thanks

As I can read in documentation, there is somes changes in Consul 14.4+

I change my nomad configuration to:

consul {
    # The address to the Consul agent.
    address      = "127.0.0.1:8501"
    grpc_address = "127.0.0.1:8503"
    ssl = true
    grpc_ca_file = "/etc/ssl/hashistack/hashistack-ca.pem"
    ca_file = "/etc/ssl/hashistack/hashistack-ca.pem"
    cert_file = "/etc/ssl/hashistack/dc1-server-consul.pem"
    key_file = "/etc/ssl/hashistack/dc1-server-consul.key"
    token = "ebfb82e3-1d84-95d3-22d0-269b427136fb"
    # The service name to register the server and client with Consul.
    server_service_name = "nomad-servers"
    client_service_name = "nomad-clients"
    tags = {}

    # Enables automatically registering the services.
    auto_advertise = true

    # Enabling the server and client to bootstrap using Consul.
    server_auto_join = true
    client_auto_join = true
}

and Consul:

    "addresses": {
        "dns": "0.0.0.0",
        "grpc_tls": "0.0.0.0",
        "http": "0.0.0.0",
        "https": "0.0.0.0"
    },

[...]

    "ports": {
        "dns": 8600,
	"grpc": 8502,
        "grpc_tls": 8503,
        "http": -1,
        "https": 8501,
        "serf_lan": 8301,
        "serf_wan": 8302,
        "server": 8300
    },

And wonderful, a new error!

[2023-03-10 15:16:30.771][1][warning][config] [./source/common/config/grpc_stream.h:201] DeltaAggregatedResources gRPC config stream to local_agent closed since 1509s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268436498:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_CERTIFICATE
[2023-03-10 15:16:47.606][1][warning][config] [./source/common/config/grpc_stream.h:201] DeltaAggregatedResources gRPC config stream to local_agent closed since 1526s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: TLS error: 268436498:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_CERTIFICATE

Certificates are valids. I don’t understand.

And in Consul log

2023-03-10T16:19:44.610+0100 [WARN]  agent: Check deregistration blocked by ACLs: check=service:_nomad-task-c725a40d-916a-50fb-b91c-fc3f04c94f21-group-mosquitto-mqtt-1883-sidecar-proxy:2 accessorID="anonymous token"
2023-03-10T16:19:44.610+0100 [WARN]  agent: Check deregistration blocked by ACLs: check=service:_nomad-task-694fd9b8-4bb2-d4fa-b3ed-52423f68f653-group-mosquitto-mqtt-1883-sidecar-proxy:2 accessorID="anonymous token"

But is not an anonymous token?!
Service is mark as Registered via Nomad

I’m lost… :disappointed_relieved:

Hi @fred-gb, sorry I have no idea about the certificates problem these errors are coming from the Envoy that Consul uses under the hood - you may have better asking in the Consul forum.

As for deregistrations blocked by ACLs - this reminds me of https://github.com/hashicorp/consul/issues/9577 where Consul would delete the service identity token before the service, causing further deregistration attempts to fail. The solution was/is to give the Consul anonymous token sufficient ACL permissions to remove services from itself, which Consul then uses as a fallback mechanism.

1 Like

Hello @seth.hoenig

Thanks. I will continue to another forum.

For the deregistration problem, It’s weird to improve security level with ACL and certs to give to anonymous write permissions.

So… Another problem.

Thanks again and will post here solution, if exists.

        "tokens": {
            "initial_management": "698eeb8c-3137-54df-aa9a-19ab58f1e4e8",
            "replication": "7adcd89e-f121-571b-ba33-79c14235b17c"
        }

I’ve had the same issue no later than today. With a default policy of deny, you need to specify the agent token for the consul agent to perform tasks on the node, or else it will try to do so using the anonymous token.

What happens I think if you check the consul logs is that the agent on the node the service is running cannot deregister it when nomad stops the job, hence the duplicate.

        "tokens": {
            "initial_management": "698eeb8c-3137-54df-aa9a-19ab58f1e4e8",
            "replication": "7adcd89e-f121-571b-ba33-79c14235b17c"
            "agent": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
        }

let me know if that works :slight_smile:

1 Like

Thanks @ednxzu !
:partying_face:
It works!

Before your message I tested another solution like said @seth.hoenig with links to github.

  • Create policy with service_prefix with write permission
  • Create token associate to this policy
  • And set this new token as default token.

But your solution seem to better I think.

I need to add this to ansible now.

And my other problem was posted on github Consul:

Thanks again to all!

1 Like