Having trouble running jobs on a cluster

fisher.shai · April 18, 2021, 8:30pm

Hey guys,

i am running on Ubuntu 20.04
nomad v1.0.4

just created a simple local cluster,
3 servers and 3 clients.

output of command : nomad server members
Name Address Port Status Leader Protocol Build Datacenter Region
nomadm1.global 192.168.14.151 4648 alive false 2 1.0.4 dc1 global
nomadm2.global 192.168.14.152 4648 alive true 2 1.0.4 dc1 global
nomadm3.global 192.168.14.153 4648 alive false 2 1.0.4 dc1 global

cat /etc/nomad.d/nomad.hcl (on a client node)
data_dir = “/opt/nomad/data”
bind_addr = “192.168.14.155”

client {
  enabled = true
  servers = ["192.168.14.151:4647","192.168.14.152:4647","192.168.14.153:4647"]
}

cat /etc/nomad.d/nomad.hcl (on a server node ,each server has the other servers in ‘retry_join’)
data_dir = “/opt/nomad/data”
bind_addr = “192.168.14.152”

server {
  enabled = true
  bootstrap_expect = 3  
  server_join {
    retry_join = ["192.168.14.151:4648","192.168.14.153:4648"]
  }
}

i used bind_addr becuase when trying to join nodes without it they would join with the docker bridge interface.

now i can see all nodes in the management interface.
trying to run any job results with this error
Failed to start container 3db72093caeb1b654e5d3d543b77edc113700e733a75e873f24ce33dd038fcf5: API error (500): error while creating mount source path '/opt/nomad/data/alloc/ebd04c8d-02b5-1006-75ff-7611cb38a903/alloc': mkdir /opt/nomad: read-only file system

My goal is to create an offline cluster which i use local images for my jobs, so i have another web server that i can direct the artifact to,

currenty my job hcl looks like this:

job "test1" {
  # Spread the tasks in this job between us-west-1 and us-east-1.
  datacenters = ["dc1"]

  # Run this job as a "service" type. Each job type has different
  # properties. See the documentation below for more examples.
  type = "service"

  # Specify this job to have rolling updates, two-at-a-time, with
  # 30 second intervals.
  update {
    stagger      = "30s"
    max_parallel = 2
  }


  # A group defines a series of tasks that should be co-located
  # on the same client (host). All tasks within a group will be
  # placed on the same host.
  group "webs" {
    # Specify the number of these tasks we want.
    count = 3

    network {
      # This requests a dynamic port named "http". This will
      # be something like "46283", but we refer to it via the
      # label "http".
      port "http" {}
    }

    # The service block tells Nomad how to register this service
    # with Consul for service discovery and monitoring.
    service {
      # This tells Consul to monitor the service on the port
      # labelled "http". Since Nomad allocates high dynamic port
      # numbers, we use labels to refer to them.
      port = "http"

      check {
        type     = "http"
        path     = "/health"
        interval = "10s"
        timeout  = "2s"
      }
    }

    # Create an individual task (unit of work). This particular
    # task utilizes a Docker container to front a web application.
    task "frontend" {
      # Specify the driver to be "docker". Nomad supports
      # multiple drivers.
      driver = "docker"
      
      artifact {
        source      = "http://192.168.14.250/webappv1.0.1.tar"
        
      # Configuration is specific to each driver.
      }
      
      config {
        load  = "webappv1.0.1.tar"
        image = "webappv1.0.1"
      }

      resources {
        cpu    = 500 # MHz
        memory = 128 # MB
      }
    }
  }
}

any help would be gladly appreciated !

fhemberger · April 18, 2021, 8:49pm

As which user are you running Nomad on your client nodes?
Who is the owner of /opt/nomad/data?

fisher.shai · April 19, 2021, 6:40am

i have a user called nomad which is the owner of /opt/nomad/data

i have entered the line
User=root
to /etc/systemd/system/nomad.service
restarted the service and made sure it runs by root
in all clients,

than changed owner of /opt/nomad to root
sudo chown -R root /opt/nomad

tried again but have the same issue

edit:

Sorry it was a job configuration issue!
I can run jobs smoothly.

Thanks for you help!

lgfa29 · April 19, 2021, 1:18pm

Glad to hear it’s working now @fisher.shai

Moving forward, make sure your Nomad clients are running as root. This is needed so they are able to manage things like chroot environments etc.

Your servers can run as non-root users, they would only need write access to the data_dir.

For more information, please check our deployment guide and security model.

ravi · January 5, 2022, 1:10am

Could you describe the issue with the job config. I seem to getting this error now.

Edstub207 · October 7, 2022, 2:28pm

@lgfa29 Hello!

We’ve been hitting this issue since an unplanned shutdown of the machines. We have upgraded the instance to latest and also ensured the instance is running as root but still have an issue.

    2022-10-07T14:24:58.276Z [DEBUG] client.driver_mgr.docker: allocated static port: driver=docker task_name=server ip=10.XXX.XX.XXX port=25883 label=http
    2022-10-07T14:24:58.276Z [DEBUG] client.driver_mgr.docker: exposed port: driver=docker task_name=server port=25883 label=http
    2022-10-07T14:24:58.276Z [DEBUG] client.driver_mgr.docker: applied labels on the container: driver=docker task_name=server labels=map[com.hashicorp.nomad.alloc_id:31703f99-21a3-ab90-0806-f8c5d02afc26]
    2022-10-07T14:24:58.276Z [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=server container_name=server-31703f99-21a3-ab90-0806-f8c5d02afc26
    2022-10-07T14:24:58.315Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=6af7138d2caff52e63eef6dc1ee34b083d0b74255dbd0537d8d3b0e1f9a28f61
    2022-10-07T14:24:58.520Z [DEBUG] client.driver_mgr.docker: failed to start container: driver=docker container_id=6af7138d2caff52e63eef6dc1ee34b083d0b74255dbd0537d8d3b0e1f9a28f61 attempt=1 error="API error (500): error while creating mount source path '/opt/nomad/data/alloc/31703f99-21a3-ab90-0806-f8c5d02afc26/alloc': mkdir /opt/nomad: read-only file system"
    2022-10-07T14:24:58.824Z [DEBUG] client.driver_mgr.docker: failed to start container: driver=docker container_id=6af7138d2caff52e63eef6dc1ee34b083d0b74255dbd0537d8d3b0e1f9a28f61 attempt=2 error="API error (500): error while creating mount source path '/opt/nomad/data/alloc/31703f99-21a3-ab90-0806-f8c5d02afc26/alloc': mkdir /opt/nomad: read-only file system"
    2022-10-07T14:24:59.744Z [DEBUG] client.driver_mgr.docker: failed to start container: driver=docker container_id=6af7138d2caff52e63eef6dc1ee34b083d0b74255dbd0537d8d3b0e1f9a28f61 attempt=3 error="API error (500): error while creating mount source path '/opt/nomad/data/alloc/31703f99-21a3-ab90-0806-f8c5d02afc26/alloc': mkdir /opt/nomad: read-only file system"
    2022-10-07T14:25:03.073Z [DEBUG] client.driver_mgr.docker: failed to start container: driver=docker container_id=6af7138d2caff52e63eef6dc1ee34b083d0b74255dbd0537d8d3b0e1f9a28f61 attempt=4 error="API error (500): error while creating mount source path '/opt/nomad/data/alloc/31703f99-21a3-ab90-0806-f8c5d02afc26/server/local': mkdir /opt/nomad: read-only file system"
    2022-10-07T14:25:12.827Z [DEBUG] http: request complete: method=POST path=/v1/deployment/fail/2c4f0de1-0e74-3992-a631-bd5ad0dfea0e duration=3.82315ms
    2022-10-07T14:25:12.827Z [DEBUG] http: request complete: method=GET path=/v1/job/whoami/evaluations?index=5273 duration=14.081975561s
    2022-10-07T14:25:12.827Z [DEBUG] http: request complete: method=GET path=/v1/job/whoami/deployment?index=5272 duration=16.084586948s
    2022-10-07T14:25:13.173Z [DEBUG] http: request complete: method=GET path=/v1/job/whoami/evaluations?index=5277 duration=309.350697ms
    2022-10-07T14:25:15.709Z [DEBUG] http: request complete: method=GET path=/v1/job/whoami/evaluations?index=5278 duration=844.510995ms
    2022-10-07T14:25:15.709Z [DEBUG] http: request complete: method=DELETE path=/v1/job/whoami duration=4.548948ms
    2022-10-07T14:25:15.709Z [DEBUG] http: request complete: method=GET path=/v1/job/whoami?index=5272 duration=18.967337523s
    2022-10-07T14:25:15.713Z [DEBUG] http: request complete: method=GET path=/v1/job/whoami/allocations?index=5275 duration=18.785341566s
    2022-10-07T14:25:15.716Z [DEBUG] client: updated allocations: index=5280 total=1 pulled=1 filtered=0
    2022-10-07T14:25:15.716Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=0
    2022-10-07T14:25:15.717Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=0 errors=0
    2022-10-07T14:25:15.756Z [DEBUG] http: request complete: method=GET path=/v1/job/whoami?index=5279 duration="899.087µs"
    2022-10-07T14:25:15.818Z [DEBUG] http: request complete: method=GET path=/v1/job/whoami/allocations?index=5280 duration=59.02424ms
    2022-10-07T14:25:15.915Z [DEBUG] client: updated allocations: index=5283 total=1 pulled=0 filtered=1
    2022-10-07T14:25:15.915Z [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=1
    2022-10-07T14:25:15.915Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=1 errors=0
    2022-10-07T14:25:15.998Z [DEBUG] client.driver_mgr.docker: failed to start container: driver=docker container_id=6af7138d2caff52e63eef6dc1ee34b083d0b74255dbd0537d8d3b0e1f9a28f61 attempt=5 error="API error (500): error while creating mount source path '/opt/nomad/data/alloc/31703f99-21a3-ab90-0806-f8c5d02afc26/alloc': mkdir /opt/nomad: read-only file system"
    2022-10-07T14:25:16.868Z [DEBUG] http: request complete: method=GET path=/v1/job/whoami/evaluations?index=5279 duration=1.105729ms

We have no idea how to resolve this without deleting all of nomad and then starting from scratch, at which point we’ll no doubt hit the same issue.

Our setup is 4 Ubuntu VM’s, all joined via IP and running both a client and server in the same config. So fairly simple.

We have also had problems with Docker as well, where I always need to run sudo chmod 666 /var/run/docker.sock to get docker to register, which… shouldn’t really be needed.

I am running the pre-compiled binaries for Docker + Nomad.

Edstub207 · October 12, 2022, 4:19pm

Hello @mnomitch not sure if this is something you can help with - this is somewhat blocking us so would be good to have an idea/hand on this. I would re-create the cluster, but I’d like to avoid running into the same problem again if possible!

mnomitch · October 12, 2022, 4:44pm

Hey @Edstub207, I’m not sure I can fully diagnose, but it looks like this is a permissions issue based on the error message mkdir /opt/nomad: read-only file system. Since you’re running the agent as a client (as well as a server), then you’ll have to run it as root. I would try doing that and seeing if the issue resolves.

Also, unrelated to the main issue, for Nomad servers, you won’t get any benefit of having 3 Nomad servers versus 4. This is due to how the raft algorithm works. So I might do 3 mixed server/client nodes and then 1 client-only node. As it stands, you’ll be replicating to the 4th server without getting any benefit.

Edstub207 · October 12, 2022, 5:02pm

Yeah, we tried running as the nomad user and the root user. Permissions table looks fine as well. So quite puzzled on this… There is references in this thread of a broken config, but nothing has changed in our config either - If there is anything I can do to help with troubleshooting this issue, do let me know!

Thanks for the heads up RE: 3 vs 4 etc… We can scale back to three clients (Or three clients/servers, depending on what our solution for this read/write issue is)

mnomitch · October 14, 2022, 5:21pm

Hrm… well I’m stumped unfortunately. I would dig more into that specific error and why that might appear when in sudo. - IE mount - How to fix read-only file-system on 18.04 - Ask Ubuntu

Edstub207 · January 16, 2023, 7:41pm

Hello @mnomitch I think I found the problem, potentially, we where re-investigating Nomad today and during that process, when I initially setup the cluster I had to run the Nomad process as sudo initially as nomad to create some of the dependent directories. I haven’t dug into it much, but it looks like the Nomad user maybe doesn’t have the correct perms when following the step by step guide or maybe I missed a step? As a result, I can imagine when the service stopped, it didn’t have the correct permissions everywhere. I’ve also installed Docker manually instead than via Snap, as that caused problems with other things. Will see how things go.

crystalin · March 13, 2024, 10:56am

This can also happen if you have multiple installation of docker (snap + apt). While the issue might not happen when you install nomad or use it, it might appear after a reboot when both docker get started.

Topic		Replies	Views
Debugging Job Driver Failure Nomad	0	1069	February 22, 2022
Nomad leader not accept jobs Nomad	2	332	April 20, 2022
Nomad jobs failed to launch command with executor Nomad	0	835	December 2, 2021
Cannot Get Tiny Cluster Working Nomad	1	339	April 13, 2021
Client not able to connect to nomad cluster Nomad	4	53	August 27, 2024

Having trouble running jobs on a cluster

Related topics