Sudden increase in Nomad server memory

I had a production server issue the following being the memory profile of the servers:

The compute profile is mostly about periodic jobs (~100+)

Description in brief:

  • the same server and compute configuration has been without incident for 6+ months.
  • I happened to update the servers to 1.3.0 on May 13th.
  • the 8 GB memory started getting hit around May 18th, causing loss of leader.
  • rebooting the servers helped, but only for a few minutes.
  • so, manually we updated the server type from large to 4xlarge

My questions are:

  • What would the possible reason for the memory increase?
    What metric to keep an eye on? And set an alert for the future?

  • How to debug this if/when this occurs again?

  • We already have basic alerts on memory and cluster leader change, etc., was wondering what logs can be captured in the future.

We plan to tune down to 2xl, maybe after monitoring this for a week.

Hey @shantanugadgil!

What would the possible reason for the memory increase?

It is very hard to say as 1.3.0 introduced a number of changes, although I wouldn’t expect any of them to have caused such as significant change. If something within Nomad did, we should investigate and figure out what can be improved.

1.3.0 did migrate to a new boltdb storage engine, however, I would expect this migration to have a single initial increase in memory which would then go away after migration steps have been taken. Is this problem still ongoing or has memory usage fallen to a typical level?

I’d like to note for other readers, memory increases within Nomad servers can also be caused normal operations such as a large influx of evaluations or leadership elections. In these cases, the memory increase should reduce back to normal level quickly once the spike of work has been completed.

What metric to keep an eye on? And set an alert for the future?

I would say what your graph is detailing is a good indicator to track and keep an eye on. Ideally the alert would monitor 95 percentile changes over a time range to accommodate fluctuations that may occur during normal Nomad activity.

How to debug this if/when this occurs again?

Increases in Nomad memory usage are tricky to understand particularly if it is tied to a potential version change like this. The operator debug command would be useful to capture details regarding the cluster state. Ideally in this situation, multiple bundles would be captured both before and after you restart the servers. This would help identify any clear differences of growth in particular subsystem resource usage. To run this command, enable_debug needs to be set to true.

When enable_debug is set to true, Nomad exposes a number of pprof endpoints which can also be called independently. We don’t document these, however, they can be found at /debug/pprof/ and an example heap dump can be collected via curl localhost:4646/debug/pprof/heap > nomad-debug/heap.out.

We already have basic alerts on memory and cluster leader change, etc., was wondering what logs can be captured in the future.

Logs potentially don’t have much use identifying problems in this situation. That being said, having logs from the servers and client available is extremely useful when debugging cluster problems. Having then available in a tool like Grafana with a dashboard already filtering on warn and error level log lines is a great place to start.

If this is still happening and you’re able to capture debug information and bundles please let me know and send large datasets via nomad-oss-debug@hashicorp.com.

Thanks,
jrasell and the Nomad team

1 Like

sorry for necromancing this thread, but the eventual discovery was that I have the Prometheus metrics enabled on the servers.

Disabling the prometheus metrics on the servers has made the memory consumption to a much saner value.

At the moment I do not remember if I enabled the prometheus for servers along with version 1.3

Updating comments in case anyone stumbles upon this thread.

I’ve disabled Prometheus, but the RAM usage still shows an increasing trend. Additionally, I’ve implemented garbage collection (GC) as a measure. Currently, I’m running a batch job involving 4000 groups with a single task.

4000 is much more than what I have. But the profile of “large no. of batch jobs” seems similar.

1 Like

So i have writter a little subscheduler with nextflow and python to submit batches of 100 groups, this seems to have helped.