"failed to setup alloc:" error during Job Deployment When Nomad runs as Root?

Hey all,

I’m working on getting started with running a standalone Nomad and Consul instance on each of my Raspberry Pi’s, bare-metal, and I’ve gotten to the point where I believe they’re healthy, however when I try to run my test job on Nomad the deployment fails and I get the below error.

failed to setup alloc: pre-run hook "alloc_dir" failed: Couldn't change owner/group of /opt/nomad/alloc/82f22c66-9d7a-fb69-97fe-923abd473dac/alloc to (uid: 65534, gid: 65534): chown /opt/nomad/alloc/82f22c66-9d7a-fb69-97fe-923abd473dac/alloc: operation not permitted

Initially I started out using a separate Nomad user (nomad:nomad), however I’m mainly planning on running the Docker driver which I saw needs root, so I chown’d the /opt/nomad and /opt/consul directories recursively to set root, and edited my service file entries to have root be the user actually running the processes.

I was wondering where that permission issue may lie, and/or if there’s an artifact from when it was first run by the nomad user that I need to purge in order to get this to work? Any help would be appreciated. Here’s my Nomad details:

My parsed-down nomad.hcl (Nomad config) file

data_dir  = "/opt/nomad/"
bind_addr = "0.0.0.0"
server {
  enabled          = true
  bootstrap_expect = 1
}
client {
  enabled = true
  servers = ["127.0.0.1:4647"]
}
datacenter = "dc1"

ports {
  http = 4646
  rpc  = 4647
  serf = 4648
}
tls {
  http = false
  rpc  = false
}
log_level = "INFO"
log_file  = "/etc/nomad.d/nomad.log"

My nomad.service (Systemd service) file

[Unit]
Description="HashiCorp Nomad - A tool for managing workloads"
Documentation=https://developer.hashicorp.com/nomad/docs
Requires=network-online.target
After=network-online.target 
ConditionFileNotEmpty=/etc/nomad.d/nomad.hcl
StartLimitIntervalSec=60
StartLimitBurst=3

[Service]
#User=nomad
#Group=nomad
ProtectSystem=true
ProtectHome=read-only
PrivateTmp=yes
Environment=
PrivateDevices=yes
SecureBits=keep-caps
AmbientCapabilities=CAP_IPC_LOCK
#Capabilities=CAP_IPC_LOCK+ep
CapabilityBoundingSet=CAP_SYSLOG CAP_IPC_LOCK
NoNewPrivileges=yes
ExecStart=nomad agent -config /etc/nomad.d/
ExecReload=/bin/kill --signal HUP $MAINPID
KillMode=process
KillSignal=SIGINT
Restart=on-abnormal
RestartSec=5
TimeoutStopSec=30
StartLimitInterval=60
#StartLimitIntervalSec=60
StartLimitBurst=3
LimitNOFILE=65536
LimitMEMLOCK=infinity

[Install]
WantedBy=multi-user.target

And lastly, my test.hcl (Nomad test jobspec):

job "test" {
    datacenters = ["dc1"]
    type = "service"
    update {
        stagger = "30s"
        max_parallel = "2"
    }

    group "example" {
        count = 1
        network {
            port "http" {
                to = 8080
            }
        }

        service {
            port = "http"
        }

        task "whoami" {
            driver = "docker"
            config {
                image = "traefik/whoami"
                ports = ["http"]
            }
            resources {
                cpu = 200
                memory = 200
            }
        }
    }
}

Hi, these look like restrictions:

AmbientCapabilities=CAP_IPC_LOCK 
#Capabilities=CAP_IPC_LOCK+ep 
CapabilityBoundingSet=CAP_SYSLOG CAP_IPC_LOCK 
NoNewPrivileges=yes

And

ProtectSystem=true
ProtectHome=read-only

Ahh, I think I found the Vault service file first, and built the Consul and Nomad services from that, and kept those in there. Thank you! Unfortunately I can’t quite test it as a separate issue I thought I had fixed before has reappeared. Getting

Constraint ${attr.consul.version} semver >= 1.8.0 filtered 1 node

as I try to place the test job. I thought I had fixed it by removing the conul block from my Nomad config since I kept things as default but apparently not. I’ll have to troubleshoot that some more. Thank you for your help though, I’ll report back if that does indeed do it once I sort this mess out!

That did it, thank you for your help! I was able to sort out the other issue, and the job is successfully deploying and looks like it’s running well!

Service {} is a consul service, so it pulls consul as a dependency. You can remove service registration or register a nonad service. Cheers.