Debugging Job Driver Failure

Hello,

I’m working on getting a minimal, working proof-of-concept Nomad job off the ground. I was able to get the job running within a docker-compose testbed, but I’m now having trouble getting an example running between multiple VMs. I removed my custom job command script and replaced with a simple ‘sleep’ example, but I still run into an error:

Error creating rpc client for executor plugin: plugin exited before we could connect.

I’m not using any custom plugins. Wondering if anyone has run into this issue and has any suggestions on how to further debug. I’m able to ping in both directions between VM’s, so the connection seems fine. Also inspected open ports using lsof -i -P -n and the necessary ports 4646-4648 appear to be in a listening state. Here’s the step by step setup and the log information that I know how to access. Are there any other logs that I can look at?

VM’s: Ubuntu 18.04, 16GB memory
Nomad: v0.8.7

I’m running a non-daemon startup setting. On the server I run:

nomad agent -server -config=/home/user/server.d -bootstrap-expect=1 -bind=0.0.0.0

/home/user/server.d contains two files: nomad.hcl and server.hcl:

nomad.hcl:

datacenter = "dc1"
data_dir = "/tmp/nomad"

server.hcl:

server {
    enabled = true

    # encrypt gossip communication. 
    encrypt = "I9gHdpTTOsgW7qgHa5WXew=="
}

On the client, I run:

nomad agent -client -config=/home/user/client.d -bind=0.0.0.0

where /home/user/client.d contains the same nomad.hcl as the server as well as the following client.hcl file:

client {
    enabled = true
    options = {
        "driver.raw_exec.enable" = "1"
        "driver.raw_exec.no_cgroups" = "1"
    }
}

Server log on startup:

==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from /home/user/nomad.hcl, /home/user/server.d/server.hcl
==> Starting Nomad agent…
==> Nomad agent configuration:

   Advertise Addrs: HTTP: 192.168.94.180:4646; RPC: 192.168.94.180:4647; Serf: 192.168.94.180:4648
        Bind Addrs: HTTP: 0.0.0.0:4646; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
            Client: false
         Log Level: INFO
            Region: global (DC: dc1)
            Server: true
           Version: 0.8.7

==> Nomad agent started! Log data will stream in below:

2022/02/22 08:46:15 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:192.168.94.180:4647 Address:192.168.94.180:4647}]
2022/02/22 08:46:15 [INFO] raft: Node at 192.168.94.180:4647 [Follower] entering Follower state (Leader: "")
2022/02/22 08:46:15 [INFO] serf: EventMemberJoin: ubuntu.global 192.168.94.180
2022/02/22 08:46:15.055092 [INFO] nomad: starting 8 scheduling worker(s) for [service batch system _core]
2022/02/22 08:46:15.057395 [ERR] consul: error looking up Nomad servers: server.nomad: unable to query Consul datacenters: Get http://127.0.0.1:8500/v1/catalog/datacenters: dial tcp 127.0.0.1:8500: connect: connection refused
2022/02/22 08:46:15.058341 [INFO] nomad: adding server ubuntu.global (Addr: 192.168.94.180:4647) (DC: dc1)
2022/02/22 08:46:16 [WARN] raft: Heartbeat timeout from "" reached, starting election
2022/02/22 08:46:16 [INFO] raft: Node at 192.168.94.180:4647 [Candidate] entering Candidate state in term 2
2022/02/22 08:46:16 [INFO] raft: Election won. Tally: 1
2022/02/22 08:46:16 [INFO] raft: Node at 192.168.94.180:4647 [Leader] entering Leader state
2022/02/22 08:46:16.423875 [INFO] nomad: cluster leadership acquired

On the client I connect the following:

nomad node config -update-servers 192.168.94.180

And I receive the following log message:

2022/02/22 08:49:51.151119 [INFO] client: node registration complete

I have the following job file:

job "testjob1" {
    datacenters = ["dc1"]
    update {
        stagger = "5s"
        max_parallel = 1
    }
    group "testgroup" {
        count = 1
        task "testtask" {
            driver = "raw_exec"

            config {
                command = "/bin/sleep"
                args    = ["600"]
            }
        }
    }
}

Trying to execute the job from the server using:

nomad job run /path/to/job/job.nomad

Results in:

==> Monitoring evaluation "c87c6010"
    Evaluation triggered by job "testjob1"
    Evaluation within deployment: "e63e4f2f"
    Allocation "931c55b0" created: node "f8812a28", group "testgroup"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "c87c6010" finished with status "complete"

Checking status:

$ nomad job status

ID        Type     Priority  Status   Submit Date
testjob1  service  50        pending  2022-02-22T08:54:59-08:00

$ nomad job status testjob1

ID            = testjob1
Name          = testjob1
Submit Date   = 2022-02-22T08:54:59-08:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
testgroup   0       0         0        2       0         0

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
testgroup   edbb4812  40s from now

Latest Deployment
ID          = e63e4f2f
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
testgroup   1        2       0        2          2022-02-22T09:04:59-08:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created  Modified
8afd80e1  f8812a28  testgroup   0        run      failed  21s ago  19s ago
931c55b0  f8812a28  testgroup   0        run      failed  51s ago  49s ago

Looking at the allocation:

$ nomad alloc status 8afd

ID                   = 8afd80e1
Eval ID              = cf71fe9c
Name                 = testjob1.testgroup[0]
Node ID              = f8812a28
Job ID               = testjob1
Job Version          = 0
Client Status        = failed
Client Description   = <none>
Desired Status       = run
Desired Description  = <none>
Created              = 1m20s ago
Modified             = 1m18s ago
Deployment ID        = e63e4f2f
Deployment Health    = unhealthy
Replacement Alloc ID = 2cc4adb6

Task "testtask" is "dead"
Task Resources
CPU      Memory   Disk     IOPS  Addresses
100 MHz  300 MiB  300 MiB  0     

Task Events:
Started At     = N/A
Finished At    = 2022-02-22T16:55:29Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type             Description
2022-02-22T08:55:29-08:00  Alloc Unhealthy  Unhealthy because of failed task
2022-02-22T08:55:29-08:00  Not Restarting   Error was unrecoverable
2022-02-22T08:55:29-08:00  Driver Failure   failed to start task "testtask" for alloc "8afd80e1-a12a-29d0-5f2b-f76394d7c94b": error creating rpc client for executor plugin: plugin exited before we could connect
2022-02-22T08:55:29-08:00  Task Setup       Building Task Directory
2022-02-22T08:55:29-08:00  Received         Task received by client