Nomad and accidental deregister jobs

Hi. I try to understand why my two client nodes always deregistering all jobs? I expect that all jobs will be remain register in consul. Thanks for any help!

What happens:

  1. task 8085765b register in consul by nomad-stage-02
  2. a few seconds passed
  3. task 8085765b deregister by nomad-stage-03

Cluster:

  • 1 server
  • 2 clients
# nomad server members
Name                 Address     Port  Status  Leader  Protocol  Build   Datacenter  Region
nomad-stage-01.lon  10.0.15.50  4648  alive   true    2         0.10.2  dc1         lon

# nomad node status
ID        DC   Name            Class   Drain  Eligibility  Status
0eb5e83c  dc1  nomad-stage-02  <none>  false  eligible     ready
2e0df0d5  dc1  nomad-stage-03  <none>  false  eligible     ready

job status

# nomad job status api
ID            = api
Name          = api
Submit Date   = 2019-12-14T22:36:22Z
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
api       0       0         2        0       0         0

Latest Deployment
ID          = 899143f4
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
api       2        2       2        0          2019-12-14T22:46:34Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
8085765b  0eb5e83c  api       0        run      running  8m29s ago  8m17s ago
8e256817  2e0df0d5  api       0        run      running  8m29s ago  8m18s ago

Consul server http requests
I see requests for deregister tasks.

# tshark -i 2 -T fields -e ip.src -e http.request.uri  -Y "http.request.uri contains \"/deregister/\""

10.0.15.2	/v1/agent/service/deregister/_nomad-task-8085765b-03bd-b8ce-ff85-3dea58b3a1fd-server-api-http
10.0.15.144	/v1/agent/service/deregister/_nomad-task-8e256817-9c4e-a486-9faa-fee36e56fe88-server-api-http
10.0.15.144	/v1/agent/service/deregister/_nomad-task-054371fd-8635-84ab-bab6-b9d6f634577b-redis-cache-redis-db
10.0.15.144	/v1/agent/check/deregister/_nomad-check-ace23129793db27d02688b0fe2e809600bb12a18
10.0.15.2	/v1/agent/service/deregister/_nomad-task-8085765b-03bd-b8ce-ff85-3dea58b3a1fd-server-api-http
10.0.15.144	/v1/agent/service/deregister/_nomad-task-054371fd-8635-84ab-bab6-b9d6f634577b-redis-cache-redis-db
10.0.15.144	/v1/agent/service/deregister/_nomad-task-8e256817-9c4e-a486-9faa-fee36e56fe88-server-api-http
10.0.15.144	/v1/agent/check/deregister/_nomad-check-ace23129793db27d02688b0fe2e809600bb12a18

Consul log

consul[137]:2019/12/14 22:47:56 [DEBUG] agent: Service "_nomad-client-y3ummobs4wezbb2uf2t5a6i24eaiecat" in sync
consul[137]:2019/12/14 22:47:56 [DEBUG] agent: Service "_nomad-server-naeump42jytincxz3m2tyan3lsoyrlpm" in sync
consul[137]:2019/12/14 22:47:56 [DEBUG] agent: Service "_nomad-client-sal345rz4h4ypmuqbrpfw6fi74yrbnr6" in sync
consul[137]:2019/12/14 22:47:56 [DEBUG] agent: Service "_nomad-server-v3oww4banlushdezlxku5g2ho5f24cir" in sync
consul[137]:2019/12/14 22:47:56 [DEBUG] agent: Service "_nomad-server-3au7gyp32cqshfolntomr4edckoxupij" in sync
consul[137]:2019/12/14 22:47:56 [DEBUG] agent: Check "_nomad-check-ae73c17743eda6d8176d4a3e6e984cf94027a392" in sync
consul[137]:2019/12/14 22:47:56 [DEBUG] agent: Check "_nomad-check-be4eefd46339cd5ce496d026621570bd4a49a9eb" in sync
consul[137]:2019/12/14 22:47:56 [DEBUG] agent: Check "_nomad-check-a709d1a775b05fb7bbc6dc6896f215c5dc63fd26" in sync
consul[137]:2019/12/14 22:47:56 [DEBUG] agent: Check "_nomad-check-c73d87a15cab159cd2d6cbc18de9e25dc238907c" in sync
consul[137]:2019/12/14 22:47:56 [DEBUG] agent: Check "_nomad-check-704a3311d007b397c42b3697997dea4cf848c64d" in sync
consul[137]:2019/12/14 22:47:56 [DEBUG] agent: Node info in sync
consul[137]:2019/12/14 22:47:56 [DEBUG] http: Request PUT /v1/agent/service/deregister/_nomad-task-8085765b-03bd-b8ce-ff85-3dea58b3a1fd-server-api-http (6.371574ms) from=10.0.15.2:48068

Nomad logs

nomad[14317]:     2019-12-14T22:49:56.055Z [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=2 registered_checks=0 deregistered_checks=1
nomad[14317]: consul.sync: sync complete: registered_services=1 deregistered_services=2 registered_checks=0 deregistered_checks=1
nomad[14317]:     2019-12-14T22:50:00.881Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=614.186µs

server.hcl

server {
    enabled = true

    bootstrap_expect = 1

    rejoin_after_leave = false

    enabled_schedulers = ["service","batch","system"]
    num_schedulers = 1

    node_gc_threshold = "24h"
    eval_gc_threshold = "1h"
    job_gc_threshold = "4h"

    encrypt = ""
}

client.hcl

client {
    enabled = true

    node_class = ""
    no_host_uuid = false

    max_kill_timeout = "30s"

    network_speed = 0
    cpu_total_compute = 0

    gc_interval = "1m"
    gc_disk_usage_threshold = 80
    gc_inode_usage_threshold = 70
    gc_parallel_destroys = 2

    reserved {
        cpu = 0
        memory = 0
        disk = 0
    }


    options = {
        "docker.auth.config" = "/root/.docker/config.json"
        "docker.cleanup.image" = "0"
        "driver.raw_exec.enable" = "1"
    }

    }

base.hcl

name = "{HOSTNAME}"
region = "lon"
datacenter = "dc1"

enable_debug = true

bind_addr = "{BIND_ADDR}"
advertise {
    http = "{BIND_ADDR}:4646"
    rpc = "{BIND_ADDR}:4647"
    serf = "{BIND_ADDR}:4648"
}
ports {
    http = 4646
    rpc = 4647
    serf = 4648
}

consul {
    # The address to the Consul agent.
    address = "{CONSUL_ADDR}:8500"
    token = "myToken"
    # The service name to register the server and client with Consul.
    server_service_name = "nomad-servers"
    client_service_name = "nomad-clients"
    tags = {}

    # Enables automatically registering the services.
    auto_advertise = true

    # Enabling the server and client to bootstrap using Consul.
    server_auto_join = true
    client_auto_join = true
}

data_dir = "/var/nomad"

log_level = "DEBUG"

hi @atomlab there’s nothing in the client or server config that jumps out at me here. It might help if we could see the job spec. Also, which version of Nomad is this?

job.hcl

job "${app_name}" {

  datacenters = ["dc1"]
  region = "fsn1"
  type = "service"

  group "${app_name}" {
    count = 2

    reschedule {
      unlimited      = true
      delay          = "5s"
      delay_function = "constant"
    }

    restart {
      attempts = 3
      delay    = "30s"
    }

    update {
      max_parallel     = 1
      canary           = 1
      min_healthy_time = "30s"
      healthy_deadline = "1m"
      auto_revert      = true
      auto_promote     = true
      health_check     = "task_states"
    }

    task "server" {
      driver = "docker"
      config {
        image = "${image}"
        port_map {
          http = 3000
        }
      }
      env {
        MONGODB_HOST = "10.0.15.30"
        PORT = 3000
      }
      service {
        name = "${app_name}"
        port = "http"
        tags = [
          "${app_name}",
          "traefik.tags=service",
          "traefik.frontend.rule=Host:${app_fqdn}"
        ]
        canary_tags = ["canary"]
        check {
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
          
          check_restart {
            limit = 3
            grace = "30s"
            ignore_warnings = false
          }
        }
      }
      resources {
        network {
          mbits = 10
	        port  "http"{}
        }
      } # End resources
    } # End tasks
  } # End group
} # End job

nomad job inspect api

{
    "Job": {
        "Affinities": null,
        "AllAtOnce": false,
        "Constraints": null,
        "CreateIndex": 12582,
        "Datacenters": [
            "dc1"
        ],
        "Dispatched": false,
        "ID": "api",
        "JobModifyIndex": 12582,
        "Meta": null,
        "Migrate": null,
        "ModifyIndex": 12599,
        "Name": "api",
        "Namespace": "default",
        "ParameterizedJob": null,
        "ParentID": "",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "fsn1",
        "Reschedule": null,
        "Spreads": null,
        "Stable": true,
        "Status": "running",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1577026854355624463,
        "TaskGroups": [
            {
                "Affinities": null,
                "Constraints": null,
                "Count": 2,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": {
                    "HealthCheck": "checks",
                    "HealthyDeadline": 300000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 10000000000
                },
                "Name": "api",
                "Networks": null,
                "ReschedulePolicy": {
                    "Attempts": 0,
                    "Delay": 5000000000,
                    "DelayFunction": "constant",
                    "Interval": 0,
                    "MaxDelay": 3600000000000,
                    "Unlimited": true
                },
                "RestartPolicy": {
                    "Attempts": 3,
                    "Delay": 30000000000,
                    "Interval": 1800000000000,
                    "Mode": "fail"
                },
                "Services": null,
                "Spreads": null,
                "Tasks": [
                    {
                        "Affinities": null,
                        "Artifacts": null,
                        "Config": {
                            "port_map": [
                                {
                                    "http": 3000.0
                                }
                            ],
                            "image": "hub:9090/api:16"
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "docker",
                        "Env": {
                            "PORT": "3000",
                            "MONGODB_HOST": "10.0.15.30"
                        },
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Kind": "",
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "server",
                        "Resources": {
                            "CPU": 100,
                            "Devices": null,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 300,
                            "Networks": [
                                {
                                    "CIDR": "",
                                    "Device": "",
                                    "DynamicPorts": [
                                        {
                                            "Label": "http",
                                            "To": 0,
                                            "Value": 0
                                        }
                                    ],
                                    "IP": "",
                                    "MBits": 10,
                                    "Mode": "",
                                    "ReservedPorts": null
                                }
                            ]
                        },
                        "Services": [
                            {
                                "AddressMode": "auto",
                                "CanaryTags": [
                                    "canary"
                                ],
                                "CheckRestart": null,
                                "Checks": [
                                    {
                                        "AddressMode": "",
                                        "Args": null,
                                        "CheckRestart": {
                                            "Grace": 30000000000,
                                            "IgnoreWarnings": false,
                                            "Limit": 3
                                        },
                                        "Command": "",
                                        "GRPCService": "",
                                        "GRPCUseTLS": false,
                                        "Header": null,
                                        "Id": "",
                                        "InitialStatus": "",
                                        "Interval": 10000000000,
                                        "Method": "",
                                        "Name": "service: \"api\" check",
                                        "Path": "/",
                                        "PortLabel": "",
                                        "Protocol": "",
                                        "TLSSkipVerify": false,
                                        "TaskName": "",
                                        "Timeout": 2000000000,
                                        "Type": "http"
                                    }
                                ],
                                "Connect": null,
                                "Id": "",
                                "Meta": null,
                                "Name": "api",
                                "PortLabel": "http",
                                "Tags": [
                                    "api",
                                    "traefik.tags=service",
                                    "traefik.frontend.rule=Host:api.example.com"
                                ]
                            }
                        ],
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null,
                        "VolumeMounts": null
                    }
                ],
                "Update": {
                    "AutoPromote": true,
                    "AutoRevert": true,
                    "Canary": 1,
                    "HealthCheck": "task_states",
                    "HealthyDeadline": 60000000000,
                    "MaxParallel": 1,
                    "MinHealthyTime": 30000000000,
                    "ProgressDeadline": 600000000000,
                    "Stagger": 30000000000
                },
                "Volumes": null
            }
        ],
        "Type": "service",
        "Update": {
            "AutoPromote": false,
            "AutoRevert": false,
            "Canary": 0,
            "HealthCheck": "",
            "HealthyDeadline": 0,
            "MaxParallel": 1,
            "MinHealthyTime": 0,
            "ProgressDeadline": 0,
            "Stagger": 30000000000
        },
        "VaultToken": "",
        "Version": 0
    }
}
~# nomad version
Nomad v0.10.2 (0d2d6e3dc5a171c21f8f31fa117c8a765eb4fc02)

Hi again @atomlab! I was away on vacation for a couple weeks but now I wanted to circle back to you. The jobspec looks ok.

From the HTTP requests you grabbed from Consul, I can see it’s not just that job that got deregistered though: we can see API tasks for alloc 8085765b and 8e256817, but also one for the redis job w/ alloc 054371fd. Are the tasks flapping?

If you run nomad alloc status -verbose 8085765b (or any of the other alloc IDs) that’ll output some more detailed information about what Nomad thought was going on with that alloc.

1 Like

today I learnt something -verbose for alloc status