Nomad and podman rootless cannot start container

dvlpmike · January 15, 2024, 8:11am

Hello Folks! I’ve been trying to run Podman in rootless mode with Nomad. I have a single node (serving as both server and client). Despite researching documentation and blogs, I haven’t found a solution yet. Below, you’ll find my Nomad configuration:

datacenter              = "dc1"
data_dir                = "/var/lib/nomad"
plugin_dir              = "/opt/nomad/plugins"

server {
  enabled = true
  bootstrap_expect = 1
}

client {
  enabled        = true
  servers        = ["127.0.0.1"]
}

bind_addr = "0.0.0.0"

ui {
  enabled =  true
}

plugin "nomad-driver-podman" {
  config {
    socket_path = "unix:///run/user/1001/podman/podman.sock"
    disable_log_collection = false
  }
}

enable_syslog = true
log_level     = "INFO"
log_file      = "/var/log/nomad-server.log"

Podman info:

host:
  arch: amd64
  buildahVersion: 1.31.3
  cgroupControllers: []
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.6-1.module+el8.8.0+18098+9b44df5f.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.6, commit: 8c4ab5a095127ecc96ef8a9c885e0e1b14aeb11b'
  cpuUtilization:
    idlePercent: 98.22
    systemPercent: 0.77
    userPercent: 1.01
  cpus: 1
  databaseBackend: boltdb
  distribution:
    distribution: '"rhel"'
    version: "8.8"
  eventLogger: file
  freeLocks: 2032
  hostname: podman1.dvlpmike.lab
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1001
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    - container_id: 65537
      host_id: 165536
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1001
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    - container_id: 65537
      host_id: 165536
      size: 65536
  kernel: 4.18.0-477.10.1.el8_8.x86_64
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 654635008
  memTotal: 1864368128
  networkBackend: cni
  networkBackendInfo:
    backend: cni
    dns:
      package: podman-plugins-4.6.1-4.module+el8.9.0+20326+387084d0.x86_64
      path: /usr/libexec/cni/dnsname
      version: |-
        CNI dnsname plugin
        version: 1.3.1
        commit: unknown
    package: containernetworking-plugins-1.2.0-1.module+el8.8.0+18060+3f21f2cc.x86_64
    path: /usr/libexec/cni
  ociRuntime:
    name: runc
    package: runc-1.1.4-1.module+el8.8.0+18060+3f21f2cc.x86_64
    path: /usr/bin/runc
    version: |-
      runc version 1.1.4
      spec: 1.0.2-dev
      go: go1.19.4
      libseccomp: 2.5.2
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    exists: true
    path: /run/user/1001/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.1-1.module+el8.9.0+20326+387084d0.x86_64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 1719660544
  swapTotal: 1719660544
  uptime: 8h 42m 9.00s (Approximately 0.33 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /home/pod/.config/containers/storage.conf
  containerStore:
    number: 11
    paused: 0
    running: 0
    stopped: 11
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/pod/.local/share/containers/storage
  graphRootAllocated: 13742637056
  graphRootUsed: 4519514112
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 4
  runRoot: /run/user/1001/containers
  transientStore: false
  volumePath: /home/pod/.local/share/containers/storage/volumes
version:
  APIVersion: 4.6.1
  Built: 1696868155
  BuiltTime: Mon Oct  9 12:15:55 2023
  GitCommit: ""
  GoVersion: go1.20.6
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.1

And nomad job (from template):

job "hello-world" {
  datacenters = ["*"]

  meta {
    foo = "bar"
  }
  group "servers" {
    count = 1

    network {
      port "www" {
        to = 8001
      }
    }

    service {
      provider = "nomad"
      port     = "www"
    }
    task "web" {
      driver = "podman"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "${NOMAD_PORT_www}", "-h", "/local"]
        ports   = ["www"]
      }
      template {
        data        = <<-EOF
                      <h1>Hello, Nomad!</h1>
                      <ul>
                        <li>Task: {{env "NOMAD_TASK_NAME"}}</li>
                        <li>Group: {{env "NOMAD_GROUP_NAME"}}</li>
                        <li>Job: {{env "NOMAD_JOB_NAME"}}</li>
                        <li>Metadata value for foo: {{env "NOMAD_META_foo"}}</li>
                        <li>Currently running on port: {{env "NOMAD_PORT_www"}}</li>
                      </ul>
                      EOF
        destination = "local/index.html"
      }
      resources {
        cpu    = 50
        memory = 64
      }
    }
  }
}

I can successfully run a container as a non-root user. However, when attempting to run a Nomad job, I encounter the following error:

ID                     = a0435ff0-2a62-4c5b-9d62-53ea16e1246b
Eval ID                = 6ee4134c
Name                   = hello-world.servers[0]
Node ID                = ed911883
Node Name              = podman1.dvlpmike.lab
Job ID                 = hello-world
Job Version            = 0
Client Status          = failed
Client Description     = Failed tasks
Desired Status         = run
Desired Description    = <none>
Created                = 1m51s ago
Modified               = 1m45s ago
Deployment ID          = d17462f6
Deployment Health      = unhealthy
Reschedule Eligibility = 2m10s from now

Allocation Addresses:
Label  Dynamic  Address
*www   yes      192.168.1.204:21754 -> 8001

Task "web" is "dead"
Task Resources:
CPU     Memory  Disk     Addresses
50 MHz  64 MiB  300 MiB

Task Events:
Started At     = N/A
Finished At    = 2024-01-15T08:08:02Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type            Description
2024-01-15T03:08:02-05:00  Not Restarting  Error was unrecoverable
2024-01-15T03:08:02-05:00  Driver Failure  rpc error: code = Unknown desc = failed to start task, could not start container: cannot start container, status code: 500: {"cause":"exit status 1","message":"exit status 1","response":500}
2024-01-15T03:08:01-05:00  Task Setup      Building Task Directory
2024-01-15T03:08:01-05:00  Received        Task received by client

Has anyone encountered a similar issue and knows what the problem might be?

Ranjandas · January 15, 2024, 10:16am

Hi @dvlpmike,

Could you share the DEBUG level logs from your agent?

Run nomad monitor -log-level debug | tee nomad-logs.txt and then submit the job so that we can see why the job is failing.

dvlpmike · January 15, 2024, 10:34am

@Ranjandas thanks for response. I’m attaching the logs from Nomad.
nomad-logs.txt (21.6 KB)

Ranjandas · January 15, 2024, 11:54am

Hi @dvlpmike,

I tried to use your config and ran the job you shared, but I had issues. While I don’t know the root cause, I got it working by setting disable_log_collection = true and restarting the agent.

Could you try the same and see if it works? Hopefully, someone with expertise in this integration will help you with the root cause and workaround.

dvlpmike · January 15, 2024, 12:08pm

Now I got antoher one:

2024-01-15T07:07:32-05:00  Not Restarting  Error was unrecoverable
2024-01-15T07:07:32-05:00  Driver Failure  rpc error: code = Unknown desc = failed to start task, could not start container: cannot start container, status code: 500: {"cause":"OCI runtime attempted to invoke a command that was not found","message":"runc: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: openat2 /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/nomad.slice/libpod-44409b8e69e6f876e6ad3127457ab84938b5a0188e8b1b2ad3817991050a8958.scope/memory.swap.max: no such file or directory: OCI runtime attempted to invoke a command that was not found","response":500}
2024-01-15T07:07:31-05:00  Task Setup      Building Task Directory
2024-01-15T07:07:31-05:00  Received        Task received by client

Ranjandas · January 16, 2024, 4:17am

I could reproduce the issue and managed to find the fix. You have to run the step linked below so that non-root users can have the cgroup controllers delegated.

ref: [Optional] cgroup v2 | Rootless Containers

After adding the delegate.conf and running systemctl daemon-reload, I resubmitted the Nomad job and it worked.

I hope this helps!

dvlpmike · February 8, 2024, 7:46am

It’s better, but after a lot of tests I assume that the issue based on the os version. Jobs are not functioning. I’ll try to other version of the os with crgoup v2 by default.

Topic		Replies	Views
Client.driver_mgr.docker: failed to list pause container Nomad	9	794	August 2, 2023
Nomad root Client and Podman rootless socket Nomad	0	1234	February 3, 2022
Podman rootless containers end up not having a loopback interface when using Nomad Nomad	2	633	April 22, 2022
Nomad with podman driver setup issues Nomad	5	406	September 4, 2024
Podman + userns, logging, and podman-specific configuration Nomad	1	1060	March 28, 2022

Nomad and podman rootless cannot start container

Related topics