Tips for running nomad in resource-constrained environments?

My question is similar to the following, but I’d like to get into the details of the failure modes of Nomad and how they would impact a running cluster and the following recovery:

Context: I’m trying to setup nomad in a production environment, which is budget constrained. The primary reason for using nomad is to have an auto-scaling cloud setup, so that we are paying only for the compute capacity that we actually need. This also means that we don’t have the budget to have 3x 8GB cloud instances for just the nomad servers. Consider this is “cloud-bill golfing” at it’s finest (just like “code-golfing”).

Questions:

  • Can nomad work on a 2vCPU/2GB/20GB single cloud server for extended periods of time? (I have already set this up and the basic tests are working, but I’m not sure of nomad’s resource requirements over longer periods of time).
  • For a setup that is expected to auto-scale between 5-10 cloud servers, how much RAM and disk would the single nomad server consume?
  • What are the typical failure reasons for a nomad server?
  • When the single nomad server fails, do existing jobs/services continue to run?
  • Is there any way to restart the single nomad server such that it re-builds the current state of the cluster by querying all the clients?
  • Can a nomad client work on a 2vCPU/2GB/20GB machine alongside a docker daemon and the actual jobs/services? How much memory, CPU, and disk does the nomad client consume?
  • What are the typical failure reasons for a nomad client?
  • When a nomad client fails, do the jobs/services running on that machine continue to run?

Hi @saurabhnanda :wave:

I will try to answer your questions, but please keep in mind that we don’t have any long-running tests of Nomad in constraint resources, that’s why anything outside of our production recommendations are not considered supported.

In theory yes, but it depends on your expected load, more specifically the number of jobs and allocations you expect to have.

In order to achieve high scheduling throughput, Nomad keeps its state in a in-memory database, which means that, as your cluster size grows, so will your expected memory usage.

Again, this is hard to say. I think the number of clients won’t actually impact memory usage that much. Number of jobs and allocations would be more impactful.

Good question :thinking:

I think that memory pressure would be an issue due to the in-memory database. Running out of disk space would also be problematic since the servers also create snapshots in the disk to recover in-memory state.

Network latency between server and clients could also cause heartbeats to be missed, and so the server would think the client is lost.

These are the ones I could think of, but I will check with the rest of team for others.

Yes, the tasks in the client will keep running, but they may be stopped if the server doesn’t return with the same state. So, for example, if the server VM crashes, you need to make sure the new VM will have the same server data.

You could use the nomad operator snapshot commands to save and restore the data, but you will need to make sure the clients don’t connect to the server while the snapshot is not loaded, otherwise the clients will stop their tasks since they are present in the server, but they will come back after the snapshot is restored.

I think the snapshot that I mentioned before is what you are looking for here?

This will depend on your own worloads. In client mode, a Nomad agent doesn’t consume as much resource as servers since they don’t need to keep track of state.

I don’t have a good number to give you in terms of how much resource it consumes, but I would say, in a very casual and unscientific way, that it consumes very little resources.

Again, very good question :slightly_smiling_face:

I think for clients disk space could be an issue since there could be a lot to write, like application logs. So making sure your data_dir path has enough space is important.

Making sure your task dependencies healthy are important as well. For example, if using the docker task driver you will need to make sure the Docker daemon is healthy.

Hum…it depends on the failure. In general the Nomad clients will try to clean-up any task they are responsible for before exiting. If it’s an unexpected and sudden crash it may not have the opportunity to do so, and the tasks will be left running.

If the client data_dir is still available it will try to reconnect to those tasks so they are not left orphaned, but there are no guarantees.

I hope these answer your questions :slightly_smiling_face:

Let us know if you have anything else that you would like to know.

Thanks for your detailed reply @lgfa29

please keep in mind that we don’t have any long-running tests of Nomad in constraint resources, that’s why anything outside of our production recommendations are not considered supported.

I completely understand, which is why I’m trying to gather more information so that I know the trade-offs that I’m making and the additional risks that I’m introducing in the system.

In theory yes, but it depends on your expected load, more specifically the number of jobs and allocations you expect to have. In order to achieve high scheduling throughput, Nomad keeps its state in a in-memory database, which means that, as your cluster size grows, so will your expected memory usage.

Again, this is hard to say. I think the number of clients won’t actually impact memory usage that much. Number of jobs and allocations would be more impactful.

Would it be possible to point me to the in-memory’s DB schema (perhaps on Github), so that I can estimate memory consumption for my use-case?

Do you think the memory consumption is depended on the number of currently running jobs OR the total number of jobs allocations made in the history of the cluster? I presume that it’s the former, otherwise all clusters would eventually run out of memory.

Running out of disk space would also be problematic since the servers also create snapshots in the disk to recover in-memory state.

On the disk usage side, is the complete log (raft log, is it?) stored on the disk forever? If it is, then wouldn’t all clusters eventually run out of disk space? On the other hand, if the disk is used to store a snapshot of the in-memory DB (and we have control over how many historical snapshots are maintained on-disk before being deleted), then I don’t foresee disk usage to be a problem.

Yes, the tasks in the client will keep running, but they may be stopped if the server doesn’t return with the same state. So, for example, if the server VM crashes, you need to make sure the new VM will have the same server data.

Actually my question about re-building a server was the opposite. Say the single-node cluster is completely lost (including disk snapshots), but my clients are still running. Can I make the clients join a fresh (blank) nomad server where the nomad server initially re-builds it’s in-memory DB based on each client’s current state?

Sure, the database we use is called go-memdb and here are the schemas for Nomad.

Oh yeah, that’s something that I forgot to mention. Nomad has a garbage collector that will remove old Rafl log entries, so it’s mostly currently running jobs plus some historical ones until the GC runs.

Since you are in a resource constraint environment, it may be good to fine tune the GC using these *_gc configuration.

Snapshots of the Raft log are stored to disk, so when the GC runs, the snapshot will usually decrease in size as well. I believe old snapshots do get removed from time to time. But snapshots themselves can grow arbitrarily large depending on many things you have on your cluster, so it’s always good to keep an eye on disk.

Ahh got it. The answer is no. The servers are the source of truth of the cluster (to be more precise, the server leader, but you only have one so it doesn’t mater in your case). If a client joins an empty server, it will interpret that as an instruction to stop all of its running allocations.

Here are some results from my unscientific experiment:

  • Single server of 2vCPU / 2GB / 40GB
  • Single client of 2vCPU / 2GB // 40GB
  • I kept a periodic batch job running every 20s for approx 15h 30m (exact job spec given at the end of the post)
  • Growth in memory usage of nomad agent on the server:
    • VIRT: 1275 MB => 1375 MB
    • RES: 85 MB => 103 MB
    • SHR: 53 MB => 65 MB
  • Growth in disk usage of the server: ~2 MB => 33 MB

I tried the nomad system gc command and was assuming it to get memory and disk usage close to the starting values, but I was surprised to observe the following:

  1. Memory usage (on the server) did not reduce after GC
  2. Disk usage (on the server) did not reduce after GC. The data/server/raft folder kept growing throughout my experiment, but did not seem to have been GC-ed.
  3. The number of dead entries in nomad job status dropped to zero after GC

The last observation is expected, whereas the first two observations are unexpected.

Job spec for my experiment

job "batch" {   
  datacenters = ["dc1"]

  type = "batch"

  periodic {    
    // Launch every 20 seconds
    cron = "*/20 * * * * * *"

    // Do not allow overlapping runs.
    prohibit_overlap = true
  }             

  group "batch" {
    count = 1

    task "foobar" {
        driver = "docker"
        config {
          image = "ubuntu:18.04"
          command = "/bin/bash"
          args = ["-l", "-c", "sleep $(( $RANDOM % 30 )); echo 'done' | logger -T -n REDACTED -P REDACTED"]
        }
    }
  }
}

I’m not too familiar with that part of the code, but I think that’s the expected result. nomad system gc will call forceGC, which will remove old jobs, evals, deloyments etc. from memdb.

Memory reduction and disk usage reduction would come later, for Go’s own GC and Raft’s log management, and I don’t think there’s a way to control them.

I left it running for another 24 hours, and the memory usage grew to approx 220 MB. No amount of nomad system gc would bring down the memory usage. However, restarting the nomad server got me to ~100MB usage.

I’m concerned about the monotonically increasing disk and memory usage. Eventually every system is going to run out of memory and disk irrespective of it’s capacity, right?

Is there no way to prune the raft logs and to force a GC for the nomad server process?

Sorry @saurabhnanda, I missed your reply here.

I don’t think there’s a way to force a GC here. The process memory is handled by the Go runtime. There are several documents in the Internet about Go’s GC, though I can’t think of any in particular to recommend to you :thinking:

Also, our of curiosity how are you measuring memory usage? Nomad can emit its own metrics, which would handy to plot and see their progress overtime.

Not exactly related to the query, but to reducing the noise of periodic (cron) jobs, I run a raw_exec periodic job which does a system gc and system reconcile summaries :stuck_out_tongue: