Nomad silently failing

TL;DR: I created a job using the raw_exec driver and tried to run it without enabling the plugin first.

I had a job that I was trying to get running. I have been running jobs of all sorts for a while now and felt sure I new what I was doing. Yet this job would just immediately complete while Nomad would report that the job was happily running.

$ nomad job run -region="sfo2" -var-file=environment-dev.tfvars -var="facility=1234" "app-test.hcl"
==> 2021-08-27T10:37:17-06:00: Monitoring evaluation "1f172652"
    2021-08-27T10:37:17-06:00: Evaluation triggered by job "app-test"
    2021-08-27T10:37:17-06:00: Evaluation status changed: "pending" -> "complete"
==> 2021-08-27T10:37:17-06:00: Evaluation "1f172652" finished with status "complete"

I would watch the Client Monitor (Level:Trace) and see no indication of trouble:

2021-08-27T16:42:03.747Z [DEBUG] nomad: memberlist: Initiating push/pull sync with: dev-sfo2-controller-1.sfo2 1.2.3.4:4648
2021-08-27T16:42:04.897Z [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:53770
2021-08-27T16:42:07.719Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=691.69µs
2021-08-27T16:42:14.230Z [TRACE] nomad.job: job mutate results: mutator=canonicalize warnings=[] error=<nil>
2021-08-27T16:42:14.230Z [TRACE] nomad.job: job mutate results: mutator=connect warnings=[] error=<nil>
2021-08-27T16:42:14.230Z [TRACE] nomad.job: job mutate results: mutator=expose-check warnings=[] error=<nil>
2021-08-27T16:42:14.230Z [TRACE] nomad.job: job mutate results: mutator=constraints warnings=[] error=<nil>
2021-08-27T16:42:14.230Z [TRACE] nomad.job: job validate results: validator=connect warnings=[] error=<nil>
2021-08-27T16:42:14.230Z [TRACE] nomad.job: job validate results: validator=expose-check warnings=[] error=<nil>
2021-08-27T16:42:14.230Z [TRACE] nomad.job: job validate results: validator=validate warnings=[] error=<nil>
2021-08-27T16:42:14.230Z [TRACE] nomad.job: job validate results: validator=memory_oversubscription warnings=[] error=<nil>
2021-08-27T16:42:14.235Z [DEBUG] http: request complete: method=GET path=/v1/job/test-app/evaluations?index=1225 duration=44.493227741s
2021-08-27T16:42:14.236Z [DEBUG] http: request complete: method=GET path=/v1/job/test-app/evaluations?index=1225 duration=2m31.273374659s
2021-08-27T16:42:14.237Z [DEBUG] worker: dequeued evaluation: eval_id=f26246a7-b249-b3a8-d10c-59a9c9e87a3b
2021-08-27T16:42:14.237Z [TRACE] worker.system_sched.binpack: NewBinPackIterator created: eval_id=f26246a7-b249-b3a8-d10c-59a9c9e87a3b job_id=test-app namespace=default algorithm=spread
2021-08-27T16:42:14.237Z [DEBUG] worker.system_sched: reconciled current state with desired state: eval_id=f26246a7-b249-b3a8-d10c-59a9c9e87a3b job_id=test-app namespace=default place=4 update=0 migrate=0 stop=0 ignore=0 lost=0
2021-08-27T16:42:14.237Z [DEBUG] worker.system_sched: setting eval status: eval_id=f26246a7-b249-b3a8-d10c-59a9c9e87a3b job_id=test-app namespace=default status=complete
2021-08-27T16:42:14.240Z [DEBUG] worker: updated evaluation: eval="<Eval "f26246a7-b249-b3a8-d10c-59a9c9e87a3b" JobID: "test-app" Namespace: "default">"
2021-08-27T16:42:14.240Z [DEBUG] worker: ack evaluation: eval_id=f26246a7-b249-b3a8-d10c-59a9c9e87a3b
2021-08-27T16:42:14.899Z [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:53776
2021-08-27T16:42:17.721Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=274.628µs
2021-08-27T16:42:24.900Z [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:53782
2021-08-27T16:42:27.723Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=339.799µs

Checking the eval status did not give me any further details:

$ nomad eval status f26246a7
ID                 = f26246a7
Create Time        = 10m51s ago
Modify Time        = 10m51s ago
Status             = complete
Status Description = complete
Type               = system
TriggeredBy        = job-register
Job ID             = operations-promtail
Priority           = 95
Placement Failures = false

I even tailed the system logs: journalctl --follow

It was only when I finally remembered reading that raw_exec was disabled by default that I was able to move forward.

I think that Nomad should not be silent about such things. One entry in the logs would have made my troubleshooting much quicker.

Hi @SunSparc,

Thanks for using Nomad!

I empathize with your frustration and am very sorry to hear you had a less than stellar experience, at least around this issue. I encourage you to raise an issue on Github so that this issue can get prioritized for road mapping.

@DerekStrickland and the Nomad team

1 Like