Resources exhausted on fresh deployment

originaltrini0 · June 29, 2024, 6:06pm

Hello all:
First-time poster, and I’m evaluating Nomad (v1.8.1).
I am using 5 VMware Fusion (v13.5.2) VMs on MacOS Sonoma to mimic a small (3) server and (2) client cluster.
For the container engine, I am using Podman (v5.1.1) on the client nodes, and everyone is running Ubuntu 24.04.

The servers and clients are all up and are in alive/ready statuses:

$ nomad server members
Name               Address        Port  Status  Leader  Raft Version  Build  Datacenter  Region
nomad-srv1.global  192.168.20.43  4648  alive   false   3             1.8.1  lab         global
nomad-srv2.global  192.168.20.59  4648  alive   false   3             1.8.1  lab         global
nomad-srv3.global  192.168.20.32  4648  alive   true    3             1.8.1  lab         global
$
$ nomad node status
ID        Node Pool  DC   Name           Class   Drain  Eligibility  Status
bb0a9563  default    lab  nomad-client2  <none>  false  eligible     ready
35712c79  default    lab  nomad-client1  <none>  false  eligible     ready

Podman integration looks ok:

$ nomad node status -verbose 35712c79 | grep -i "podman"
podman    true      true     ready                               2024-06-29T17:07:15Z
driver.podman                   = 1
driver.podman.cgroupVersion     = v2
driver.podman.rootless          = false
driver.podman.version           = 5.1.1

While following this Hashicorp Podman guide, when attempting to run a job, it hangs and is complaining about lack of resources:

$ nomad job run --verbose nginx.nomad
==> 2024-06-29T17:27:29Z: Monitoring evaluation "712a7306-cbba-d799-718d-1c5e128c967a"
    2024-06-29T17:27:29Z: Evaluation triggered by job "nginx-podman-job"
    2024-06-29T17:27:30Z: Evaluation within deployment: "3189840b-ec6e-efde-b9c4-60a6b7c9fe6e"
    2024-06-29T17:27:30Z: Evaluation status changed: "pending" -> "complete"
==> 2024-06-29T17:27:30Z: Evaluation "712a7306-cbba-d799-718d-1c5e128c967a" finished with status "complete" but failed to place all allocations:
    2024-06-29T17:27:30Z: Task Group "nginx-group" (failed to place 1 allocation):
      * Resources exhausted on 2 nodes
      * Dimension "cpu" exhausted on 2 nodes
    2024-06-29T17:27:30Z: Evaluation "a0d60e5c-d964-5fb1-f4e6-eba9b8a93ccd" waiting for additional capacity to place remainder
==> 2024-06-29T17:27:30Z: Monitoring deployment "3189840b-ec6e-efde-b9c4-60a6b7c9fe6e"
  ⠧ Deployment "3189840b-ec6e-efde-b9c4-60a6b7c9fe6e" in progress...

    2024-06-29T17:42:55Z
    ID          = 3189840b-ec6e-efde-b9c4-60a6b7c9fe6e
    Job ID      = nginx-podman-job
    Job Version = 0
    Status      = running
    Description = Deployment is running

    Deployed
    Task Group   Desired  Placed  Healthy  Unhealthy  Progress Deadline
    nginx-group  1        0       0        0          N/A

    Allocations
    No allocations placed^C

I validated that CPU and memory usage are minimal:

$ nomad node status -self
ID              = bb0a9563-c3c8-9557-0931-09578192389d
Name            = nomad-client2
Node Pool       = default
Class           = <none>
DC              = lab
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 42m21s
Host Volumes    = <none>
Host Networks   = <none>
CSI Volumes     = <none>
Driver Status   = exec,podman

Node Events
Time                  Subsystem  Message
2024-06-29T17:05:36Z  Cluster    Node reregistered by heartbeat
2024-06-29T17:05:03Z  Cluster    Node heartbeat missed
2024-06-29T15:14:35Z  Cluster    Node registered

Allocated Resources
CPU      Memory       Disk
0/0 MHz  0 B/7.7 GiB  0 B/5.3 GiB

Allocation Resource Utilization
CPU      Memory
0/0 MHz  0 B/7.7 GiB

Host Resource Utilization
CPU      Memory           Disk
0/0 MHz  216 MiB/7.7 GiB  (/dev/mapper/ubuntu--vg-ubuntu--lv)

Allocations
No allocations placed

Here is the job I attempted to deploy:

job "nginx-podman-job" {
  datacenters = ["lab"]
  type = "service"

  group "nginx-group" {
    count = 1

    task "nginx-task" {
      driver = "podman"

      config {
        image = "docker.io/library/nginx:latest"
      }

      resources {
        cpu = 500
        memory = 256
      }
    }
  }
}

The client VMs are configured with 4 CPU cores/8GB of memory, so I am not sure what the issue could be.
The verbose flag on the job doesn’t point me to what my issue could be.
Here is what the configuration of a client node looks like:

$ cat /etc/nomad.d/nomad.hcl
datacenter = "lab"
data_dir   = "/opt/nomad/data"
plugin_dir = "/opt/nomad/plugins"
plugin "nomad-driver-podman" {
  config {
      socket_path = "unix:///run/podman/podman.sock"
      # Customize other Podman driver plugin options here if needed
  }
}

$ cat /etc/nomad.d/client.hcl
client {
  enabled = true
  servers = ["192.168.20.43", "192.168.20.59", "192.168.20.32"]
}

Can you provide any feedback on anything I may have overlooked?

Thank you

Kamilcuk · June 29, 2024, 6:32pm

Hi. Nomad detected 0/0 MHz of cpu available. See node status output.

Are you running nomad as root? Do you have cgroups v2? Post and check your nomad client logs, also enable debug logs there. There should be an error in nomad process client logs that it can’t find cpu mhz or similar.

originaltrini0 · June 30, 2024, 11:30am

Hello @Kamilcuk
Thanks for responding. That is weird…

I was not interactively running nomad as root. I tried again as root but received the same results.
I’ve made two changes since my original post.

I downgraded Podman from 5.1.1 to 4.9.3. I built 5.1.1 from the source since v5 is unavailable as a package in Ubuntu 24.04. The same issue is occurring.
On the client nodes, the configuration has been changed to capture logs (hopefully, the log_level directive is what you referred to):

$ cat nomad.hcl
datacenter = "lab"
data_dir   = "/opt/nomad/data"
log_level  = "DEBUG"
plugin_dir = "/opt/nomad/plugins"
plugin "nomad-driver-podman" {
  config {
      socket_path = "unix:///run/podman/podman.sock"
      # Customize other Podman driver plugin options here if needed
  }
}

I can confirm that cgroups v2 are available:

$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

After changing the client configuration file and restarting nomad, I can confirm that it knows about CPU cores. Didn’t see anything about CPU frequency.

$ sudo journalctl -xeu nomad -g "CPU core count"
Jun 30 10:33:16 nomad-client1 nomad[11395]:     2024-06-30T10:33:06.402Z [DEBUG] client.fingerprint_mgr.cpu: detected CPU core count: cores=4

While reviewing logs, I did come across this error:

Jun 30 10:33:16 nomad-client1 nomad[11395]:     2024-06-30T10:33:06.400Z [WARN]  client.fingerprint_mgr: failed to detect bridge kernel module, bridge network mode disabled:
Jun 30 10:33:16 nomad-client1 nomad[11395]:   error=
Jun 30 10:33:16 nomad-client1 nomad[11395]:   | 4 errors occurred:
Jun 30 10:33:16 nomad-client1 nomad[11395]:   | \t* failed to find /sys/module/bridge: stat /sys/module/bridge: no such file or directory
Jun 30 10:33:16 nomad-client1 nomad[11395]:   | \t* module bridge not in /proc/modules
Jun 30 10:33:16 nomad-client1 nomad[11395]:   | \t* module bridge not in /lib/modules/6.8.0-36-generic/modules.builtin
Jun 30 10:33:16 nomad-client1 nomad[11395]:   | \t* module bridge not in /lib/modules/6.8.0-36-generic/modules.dep

I did find a similar discussion here.
I’m unsure if this is a nomad-specific issue, but let me try installing nomad on actual hardware. If the issue persists, I’ll come back here.

Thanks

Kamilcuk · June 30, 2024, 12:22pm

Try if the file mentioned there exists nomad/client/lib/numalib/detect_linux.go at bbd1bb3485858acc0ce0de9dcb32c9aeca091d3a · hashicorp/nomad · GitHub .

originaltrini0 · June 30, 2024, 10:48pm

@Kamilcuk I have not tried your suggestion but did complete a physical hardware deployment.

I believe my error was caused by installing older CNI plugins (v1.0.1, to be exact. I don’t remember where I sourced that).
Once I installed v1.5.0, everything fell into place.

So for now, let’s consider this matter closed.

Thanks for the assistance!

Topic		Replies	Views
Nomad job run pytechco-redis.nomad.hcl, deployment is never in Successful status Nomad	14	1719	October 27, 2023
Nomad with podman driver setup issues Nomad	5	858	September 4, 2024
Tips for running nomad in resource-constrained environments? Nomad	8	2460	September 30, 2021
Placement failure due to cni version Nomad	13	1267	August 29, 2024
Dimension cpu exhaused on 3 nodes (all the nodes I have!) Nomad	2	509	October 26, 2022

Resources exhausted on fresh deployment

Related topics