Job with 3 native tasks fail on allocation, cannot get logs to troubleshoot

Hello Nomad users,

I have the following job definition:

job "native_benchmarks" {
  datacenters = ["dc1"]
  priority = 100
  type = "batch"
  constraint {
    attribute = "${attr.unique.hostname}"
    value = "myhost.company.com"
  }
  group "benchmarks" {
    task "multi_coremark" {
      driver = "exec"
      config {
        command = "/opt/coremark/multi_coremark.sh"
        no_cgroups = false
      }
      logs {
        max_files     = 1
        max_file_size = 10
      }
      resources {
        memory = 2000
      }
    }
    task "npb" {
      driver = "exec"
      config {
        command = "/opt/NPB3.0/NPB3.0-JAV/all_tests.sh"
        no_cgroups = false
      }
      logs {
        max_files     = 1
        max_file_size = 10
      }
      resources {
        memory = 3000
      }
    }
    task "ramsmp" {
      driver = "exec"
      config {
        command = "/opt/ramspeed/ramsmp_batch.sh"
        no_cgroups = false
      }
      logs {
        max_files     = 1
        max_file_size = 10
      }
      resources {
        memory = 2000
      }
    }
  }
}

The planning and running phases work but eventually the job fails:

[user@master nomad]$ nomad status native_benchmarks
ID            = native_benchmarks
Name          = native_benchmarks
Submit Date   = 2020-10-23T15:03:54-04:00
Type          = service
Priority      = 100
Datacenters   = dc1
Namespace     = default
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
benchmarks  0       0         0        3       1         0

Latest Deployment
ID          = c23def65
Status      = failed
Description = Failed due to progress deadline

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
benchmarks  1        4       0        4          2020-10-23T15:13:54-04:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
8f83ae7e  3ffa908d  benchmarks  0        run      failed    26m4s ago   23m19s ago
5e82f114  3ffa908d  benchmarks  0        stop     failed    29m44s ago  26m4s ago
e3cf560b  3ffa908d  benchmarks  0        stop     complete  30m45s ago  26m29s ago
d716cec3  3ffa908d  benchmarks  0        stop     failed    34m44s ago  30m45s ago

I was trying to get logs from any of the 3 tasks defined inside but I cannot get the logs (I assume is because none of the jobs managed to run).

The nodes are there and look healthy (I have 2 nodes, besides the controller):

[user@master nomad]$ nomad node status
ID        DC   Name                                 Class   Drain  Eligibility  Status
3ffa908d  dc1  myhost.company.com  <none>  false  eligible     ready
98a815f7  dc1  myhost2.company.com  <none>  false  eligible     ready

I can run the commands as root on myhost (the node selection part works on nomad too) as specified on each one of the tasks.

Any help is appreciated, I’m pretty new to nomad.

1 Like

So I found that running journalctl on the node where the task was allocated worked well and showed me the error (I run nomad with systemd):
journalctl -xu nomad --follow

So I changed the hcl job description file to use a ‘raw_exec’ and removed the cgroup restrictions:

job "native_benchmarks" {
  datacenters = ["dc1"]
  priority = 100
  type = "batch"
  constraint {
    attribute = "${attr.unique.hostname}"
    value = "myhost.company.com"
  }
  group "benchmarks" {
    task "multi_coremark" {
      driver = "raw_exec"
      config {
        command = "/opt/coremark/multi_coremark.sh"
      }
      logs {
        max_files     = 1
        max_file_size = 10
      }
      resources {
        memory = 2000
      }
    }
    task "npb" {
      driver = "raw_exec"
      config {
        command = "/opt/NPB3.0/NPB3.0-JAV/all_tests.sh"
      }
      logs {
        max_files     = 1
        max_file_size = 10
      }
      resources {
        memory = 3000
      }
    }
    task "ramsmp" {
      driver = "raw_exec"
      config {
        command = "/opt/ramspeed/ramsmp_batch.sh"
      }
      logs {
        max_files     = 1
        max_file_size = 10
      }
      resources {
        memory = 2000
      }
    }
  }
}

Anyways, I’ll take this as a fix :slight_smile: