Nomad job fails with "prestart hook \"task_dir\" failed: mount: operation not permitted"

sammy676776 · February 27, 2023, 12:00am

I have created a brand new cluster in linux and all the agents and servers are communicating well and showing their status in UI as well . Consul is also running fine . However when I submit a job I get the following errors.

Please note I am running as ROOT so there should not be any permission issue as such .

> 023-02-26T18:32:46.349-0500 [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=6a633002-88b2-3f74-f8f1-ae13592072fd task=web error="prestart hook \"task_dir\" failed: mount: operation not permitted"
> 2023-02-26T18:32:46.349-0500 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=6a633002-88b2-3f74-f8f1-ae13592072fd task=web reason="Error was unrecoverable"
> 2023-02-26T18:32:46.414-0500 [INFO]  client.gc: marking allocation for GC: alloc_id=6a633002-88b2-3f74-f8f1-ae13592072fd

I have tried diff kind of jobs including Java and simple shell script or just sleep command like here and they all give same errors

job "test" {
datacenters = ["dc1"]
  group "group1" {
    task "sleep" {
      driver = "raw_exec"
      config {
        command = "sleep"
        args    = ["infinity"]
      }
      resources {
        cpu    = 10
        memory = 10
      }
    }
  }
}

When I look at “nomad alloc status 831f9a6b” it shows the following

ID                     = 831f9a6b-61ba-9c5a-8b5e-b8d9a172c61f
Eval ID                = a10d4f4d
Name                   = test.group1[0]
Node ID                = 15ceea08
Node Name              = <redacted>
Job ID                 = test
Job Version            = 6
Client Status          = failed
Client Description     = Failed tasks
Desired Status         = run
Desired Description    = <none>
Created                = 1m28s ago
Modified               = 1m27s ago
Deployment ID          = 6b2df708
Deployment Health      = unhealthy
Reschedule Eligibility = 32s from now

Task "sleep" is "dead"
Task Resources
CPU     Memory  Disk     Addresses
10 MHz  10 MiB  300 MiB  

Task Events:
Started At     = N/A
Finished At    = 2023-02-26T23:56:34Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type              Description
2023-02-26T18:56:34-05:00  Not Restarting    Error was unrecoverable
2023-02-26T18:56:34-05:00  Task hook failed  task_dir: mount: operation not permitted
2023-02-26T18:56:34-05:00  Task Setup        Building Task Directory
2023-02-26T18:56:34-05:00  Received          Task received by client

seth.hoenig · February 27, 2023, 5:41pm

Hi @sammy676776 can you describe what host OS you are running Nomad on? And can you post the output of stat <path> for the path you have set for client.alloc_dir?

sammy676776 · February 27, 2023, 6:47pm

I am running on linux I have only setup data_dir in the Nomad Nodes …and I do see an alloc dir inside . Is this supposed to be set on servers or clients ?

stat /xxx/xxx/nomad_hostname/alloc
  File: ‘ /xxx/xxx/nomad_hostname/alloc’
  Size: 10        	Blocks: 0          IO Block: 4096   directory
Device: 26h/38d	Inode: 27917335686  Links: 2
Access: (0711/drwx--x--x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-02-27 03:42:03.464494997 -0500
Modify: 2023-02-26 20:03:50.233770092 -0500
Change: 2023-02-26 20:03:50.233770092 -0500

seth.hoenig · February 28, 2023, 3:20pm

The alloc_dir only exists on Client nodes - it’ll inherit the path from data_dir if not explicitly set. From the output of your stat command that directory looks fine.

Are you sure your Client agents are running as root? Can check with something like

ps -ef | grep 'nomad agent'

I suspect this is the underlying call to mount that is failing - the raw_exec driver otherwise doesn’t do much else with regard to the filesystem.

github.com

hashicorp/nomad/blob/v1.4.4/client/allocdir/fs_linux.go#L60-L62


      
          		if err := syscall.Mount("tmpfs", dir, "tmpfs", flags, options); err != nil {
          			return os.NewSyscallError("mount", err)
          		}

I created overhaul error handling in fs_linux.go · Issue #16275 · hashicorp/nomad · GitHub to help debug issues like this one in the future, but at the moment I’m not even sure what else to check for - different filesystems will fail the mount syscall for any number of various reasons.

sammy676776 · March 3, 2023, 7:00pm

Thanks @seth.hoenig
It appears running as “root” was the problem .My co worker had installed it with a diff user and had couple directories as that user but I thought root should have access to all of it but then looks like “nobody” is default user and perhaps had trouble writing …although most of directories underst there were all rwx …some weird issue but is resolved now running as local user

Topic		Replies	Views
An Issue with Nomad on Kubernetes: "Failed to mount shared directory for task" Nomad	2	354	February 16, 2023
Failure with NFS CSI volume: operation not permitted Nomad csi	12	4685	March 24, 2023
Nomad jobs failed to launch command with executor Nomad	0	824	December 2, 2021
Can't deploy example job (noob question) Nomad first-time-question	1	646	June 22, 2022
Having trouble running jobs on a cluster Nomad	11	2496	March 13, 2024

Nomad job fails with "prestart hook \"task_dir\" failed: mount: operation not permitted"

Related topics