Hello,
I’m working on getting a minimal, working proof-of-concept Nomad job off the ground. I was able to get the job running within a docker-compose testbed, but I’m now having trouble getting an example running between multiple VMs. I removed my custom job command script and replaced with a simple ‘sleep’ example, but I still run into an error:
Error creating rpc client for executor plugin: plugin exited before we could connect.
I’m not using any custom plugins. Wondering if anyone has run into this issue and has any suggestions on how to further debug. I’m able to ping in both directions between VM’s, so the connection seems fine. Also inspected open ports using lsof -i -P -n and the necessary ports 4646-4648 appear to be in a listening state. Here’s the step by step setup and the log information that I know how to access. Are there any other logs that I can look at?
VM’s: Ubuntu 18.04, 16GB memory
Nomad: v0.8.7
I’m running a non-daemon startup setting. On the server I run:
nomad agent -server -config=/home/user/server.d -bootstrap-expect=1 -bind=0.0.0.0
/home/user/server.d contains two files: nomad.hcl and server.hcl:
nomad.hcl:
datacenter = "dc1"
data_dir = "/tmp/nomad"
server.hcl:
server {
enabled = true
# encrypt gossip communication.
encrypt = "I9gHdpTTOsgW7qgHa5WXew=="
}
On the client, I run:
nomad agent -client -config=/home/user/client.d -bind=0.0.0.0
where /home/user/client.d contains the same nomad.hcl as the server as well as the following client.hcl file:
client {
enabled = true
options = {
"driver.raw_exec.enable" = "1"
"driver.raw_exec.no_cgroups" = "1"
}
}
Server log on startup:
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from /home/user/nomad.hcl, /home/user/server.d/server.hcl
==> Starting Nomad agent…
==> Nomad agent configuration:Advertise Addrs: HTTP: 192.168.94.180:4646; RPC: 192.168.94.180:4647; Serf: 192.168.94.180:4648 Bind Addrs: HTTP: 0.0.0.0:4646; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648 Client: false Log Level: INFO Region: global (DC: dc1) Server: true Version: 0.8.7
==> Nomad agent started! Log data will stream in below:
2022/02/22 08:46:15 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:192.168.94.180:4647 Address:192.168.94.180:4647}] 2022/02/22 08:46:15 [INFO] raft: Node at 192.168.94.180:4647 [Follower] entering Follower state (Leader: "") 2022/02/22 08:46:15 [INFO] serf: EventMemberJoin: ubuntu.global 192.168.94.180 2022/02/22 08:46:15.055092 [INFO] nomad: starting 8 scheduling worker(s) for [service batch system _core] 2022/02/22 08:46:15.057395 [ERR] consul: error looking up Nomad servers: server.nomad: unable to query Consul datacenters: Get http://127.0.0.1:8500/v1/catalog/datacenters: dial tcp 127.0.0.1:8500: connect: connection refused 2022/02/22 08:46:15.058341 [INFO] nomad: adding server ubuntu.global (Addr: 192.168.94.180:4647) (DC: dc1) 2022/02/22 08:46:16 [WARN] raft: Heartbeat timeout from "" reached, starting election 2022/02/22 08:46:16 [INFO] raft: Node at 192.168.94.180:4647 [Candidate] entering Candidate state in term 2 2022/02/22 08:46:16 [INFO] raft: Election won. Tally: 1 2022/02/22 08:46:16 [INFO] raft: Node at 192.168.94.180:4647 [Leader] entering Leader state 2022/02/22 08:46:16.423875 [INFO] nomad: cluster leadership acquired
On the client I connect the following:
nomad node config -update-servers 192.168.94.180
And I receive the following log message:
2022/02/22 08:49:51.151119 [INFO] client: node registration complete
I have the following job file:
job "testjob1" {
datacenters = ["dc1"]
update {
stagger = "5s"
max_parallel = 1
}
group "testgroup" {
count = 1
task "testtask" {
driver = "raw_exec"
config {
command = "/bin/sleep"
args = ["600"]
}
}
}
}
Trying to execute the job from the server using:
nomad job run /path/to/job/job.nomad
Results in:
==> Monitoring evaluation "c87c6010"
Evaluation triggered by job "testjob1"
Evaluation within deployment: "e63e4f2f"
Allocation "931c55b0" created: node "f8812a28", group "testgroup"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "c87c6010" finished with status "complete"
Checking status:
$ nomad job status
ID Type Priority Status Submit Date
testjob1 service 50 pending 2022-02-22T08:54:59-08:00
$ nomad job status testjob1
ID = testjob1
Name = testjob1
Submit Date = 2022-02-22T08:54:59-08:00
Type = service
Priority = 50
Datacenters = dc1
Status = pending
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
testgroup 0 0 0 2 0 0
Future Rescheduling Attempts
Task Group Eval ID Eval Time
testgroup edbb4812 40s from now
Latest Deployment
ID = e63e4f2f
Status = running
Description = Deployment is running
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
testgroup 1 2 0 2 2022-02-22T09:04:59-08:00
Allocations
ID Node ID Task Group Version Desired Status Created Modified
8afd80e1 f8812a28 testgroup 0 run failed 21s ago 19s ago
931c55b0 f8812a28 testgroup 0 run failed 51s ago 49s ago
Looking at the allocation:
$ nomad alloc status 8afd
ID = 8afd80e1
Eval ID = cf71fe9c
Name = testjob1.testgroup[0]
Node ID = f8812a28
Job ID = testjob1
Job Version = 0
Client Status = failed
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created = 1m20s ago
Modified = 1m18s ago
Deployment ID = e63e4f2f
Deployment Health = unhealthy
Replacement Alloc ID = 2cc4adb6
Task "testtask" is "dead"
Task Resources
CPU Memory Disk IOPS Addresses
100 MHz 300 MiB 300 MiB 0
Task Events:
Started At = N/A
Finished At = 2022-02-22T16:55:29Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2022-02-22T08:55:29-08:00 Alloc Unhealthy Unhealthy because of failed task
2022-02-22T08:55:29-08:00 Not Restarting Error was unrecoverable
2022-02-22T08:55:29-08:00 Driver Failure failed to start task "testtask" for alloc "8afd80e1-a12a-29d0-5f2b-f76394d7c94b": error creating rpc client for executor plugin: plugin exited before we could connect
2022-02-22T08:55:29-08:00 Task Setup Building Task Directory
2022-02-22T08:55:29-08:00 Received Task received by client