We have a job with 5 tasks:
- Prestart task:
setup-task
(working) - Main task (working)
- Post start task (working)
- Post stop task (working)
- Post stop task:
teardown-task
(problematic)
All these tasks make use of template blocks that refer to the job’s corresponding Nomad variable, e.g., nomad/jobs/the-job-name
:
Example job:
job "the-job-name" {
group "the-job-group" {
task "setup-task" {
lifecycle {
hook = "prestart"
sidecar = false
}
driver = "raw_exec"
template {
change_mode = "noop" # this is a prestart task
error_on_missing_key = true
destination = "$${NOMAD_SECRETS_DIR}/.setup"
env = true
data = <<EOT
{{ with nomadVar (printf "nomad/jobs/%v" (env "NOMAD_JOB_NAME")) }}
KEY1 = {{ .key1 }}
KEY2 = {{ .key2 }}
{{- end }}
EOT
}
config {
command = "/path/to/setup/script.sh"
args = [
KEY1,
KEY2
]
}
}
// not showing all the tasks
// they all have template blocks similar to the ones above and below
task "teardown-task" {
lifecycle {
hook = "poststop"
sidecar = false
}
driver = "raw_exec"
template {
change_mode = "noop" # this is a poststop task
error_on_missing_key = true
destination = "$${NOMAD_SECRETS_DIR}/.teardown"
env = true
data = <<EOT
{{ with nomadVar (printf "nomad/jobs/%v" (env "NOMAD_JOB_NAME")) }}
KEY1 = {{ .key1 }}
KEY2 = {{ .key2 }}
{{- end }}
EOT
}
config {
command = "/path/to/teardown/script.sh"
args = [
KEY1,
KEY2
]
}
}
}
}
The Nomad job runs as expected except for the last poststop teardown-task
. When it is supposed to execute it eventually results in a timeout after a couple of minutes (±12 attempts):
Nomad logs:
2023-10-26T08:12:18.266Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.5:4647
2023-10-26T08:12:18.267Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.5:4647
2023-10-26T08:12:18.267Z [ERROR] http: error authenticating built API request: error="rpc error: allocation is terminal" url="/v1/var/nomad/jobs/the-job-name?namespace=default&stale=&wait=60000ms" method=GET
2023-10-26T08:12:18.268Z [WARN] agent: (view) nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request) (retry attempt 11 after "1m0s")
2023-10-26T08:13:18.276Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.3:4647
2023-10-26T08:13:18.276Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.3:4647
2023-10-26T08:13:18.277Z [ERROR] http: error authenticating built API request: error="rpc error: allocation is terminal" url="/v1/var/nomad/jobs/the-job-name?namespace=default&stale=&wait=60000ms" method=GET
2023-10-26T08:13:18.278Z [WARN] agent: (view) nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request) (retry attempt 12 after "1m0s")
2023-10-26T08:14:18.284Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.5:4647
2023-10-26T08:14:18.285Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: allocation is terminal" rpc=ACL.WhoAmI server=10.0.2.5:4647
2023-10-26T08:14:18.285Z [ERROR] http: error authenticating built API request: error="rpc error: allocation is terminal" url="/v1/var/nomad/jobs/the-job-name?namespace=default&stale=&wait=60000ms" method=GET
2023-10-26T08:14:18.286Z [ERROR] agent: (view) nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request) (exceeded maximum retries)
2023-10-26T08:14:18.287Z [ERROR] agent: (runner) watcher reported error: nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request)
2023-10-26T08:14:18.287Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=30764d6b-18b0-270f-fe97-33a301f9198c task=teardown-task type=Killing msg="Template failed: nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request)" failed=true
2023-10-26T08:14:18.290Z [INFO] client.gc: marking allocation for GC: alloc_id=30764d6b-18b0-270f-fe97-33a301f9198c
2023-10-26T08:14:22.292Z [WARN] client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=30764d6b-18b0-270f-fe97-33a301f9198c task=teardown-task @module=logmon timestamp=2023-10-26T08:14:22.291Z
2023-10-26T08:14:22.293Z [WARN] client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=30764d6b-18b0-270f-fe97-33a301f9198c task=teardown-task @module=logmon timestamp=2023-10-26T08:14:22.292Z
2023-10-26T08:14:22.296Z [INFO] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=30764d6b-18b0-270f-fe97-33a301f9198c task=teardown-task path=/usr/bin/nomad pid=2570689
2023-10-26T08:14:22.297Z [INFO] agent: (runner) stopping
We also see this in the web UI:
The main issues seem to be:
- Error performing RPC to server: allocation is terminal
- (view) nomad.var.block(nomad/jobs/the-job-name@default.global): Unexpected response code: 500 (Server error authenticating request)
What do these mean and what can I do to resolve them?
I can assure you that the Nomad variable that it’s complaining about in the screenshot (whose name corresponds with the job name as per the example above) DOES exist.