Project

General

Profile

Feature #4946

art memory use: multiple subruns

Added by Andrei Gaponenko about 6 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
I/O
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
40.00 h
Spent time:
Scope:
Internal
Experiment:
Mu2e
SSI Package:
art
Duration:

Description

Hello,

Some of Mu2e jobs need to process input datasets containing multiple
subruns (10,000 or so). These jobs show excessive memory consumption,
and get killed on the grid. Chris has suggested to set

services.scheduler.fileMode: MERGE

in the fcl file. (Our subruns are atomic - we never split them.) That
greatly reduced the memory footprint, but still did not solve the
problem completely.

The attached plot memgrowth.pdf shows the memory grows of a simplified
job. The horizontal axis of the plot is time, the vertical axis is
memory use in KiB. The data points were obtained by sampling "ps"
output with 1 second frequency, so the true value of the peak was
likely missed. The sampling began somewhere in the middle of the job.
This job contains only a single producer that reads a data product
from subRun and records a copy of this product in subRun. All other
data products are dropped on input. Still, there is a steady memory
growth through the duration of the job, and a large spike at the end.

The size of the subrun payload does not exceed 10 kb. There is no
justification for this job to require more than 400 MiB of RSS memory.

To reproduce:

- check out Mu2e Offline v4_0_6

- put the attached file TestSRProducer_module.cc into e.g. Analyses/src

- compile

- get the attached subRunTestMerge.fcl

- run the job:

mu2e -c subRunTestMerge.fcl /mu2e/data/tdr/beam/g4s3/tdr.beam.g4s3.ds23.1006a_1025a_1025c.12192325/good/*/beamFlashOnMothers.root

Andrei

memgrowth.pdf (37.1 KB) memgrowth.pdf memory use graph for the test case Andrei Gaponenko, 11/12/2013 07:18 PM
TestSRProducer_module.cc (2.96 KB) TestSRProducer_module.cc test producer source Andrei Gaponenko, 11/12/2013 07:18 PM
subRunTestMerge.fcl (596 Bytes) subRunTestMerge.fcl test job configuration Andrei Gaponenko, 11/12/2013 07:18 PM

Related issues

Related to art - Necessary Maintenance #8427: Remove unnecessary data mamber from art::EventIDClosed04/24/2015

Related to art - Feature #8426: Understand gradual memory growth and determine if it can be mitigatedAccepted04/24/2015

Associated revisions

Revision 3998a8ec (diff)
Added by Kyle Knoepfel over 4 years ago

Implement issue #4946 - address memory spike at end of process
Move FileIndex from metadata branch to its own tree
Change file format version to 7

History

#2 Updated by Christopher Green almost 6 years ago

  • Category set to I/O
  • Status changed from New to Accepted
  • Target version set to 493
  • Start date deleted (11/12/2013)
  • Estimated time set to 8.00 h
  • SSI Package art added
  • SSI Package deleted ()

The time estimate is for the initial analysis and diagnosis only.

#3 Updated by Christopher Green over 5 years ago

  • Target version changed from 493 to 1.09.00

#4 Updated by Christopher Green over 5 years ago

  • Target version changed from 1.09.00 to 1.13.00

#5 Updated by Christopher Green almost 5 years ago

  • Target version changed from 1.13.00 to 1.14.00

Due to the elevated priority of the upcoming release, the target version for this issue has been adjusted.

#6 Updated by Kyle Knoepfel over 4 years ago

  • Tracker changed from Bug to Feature
  • Assignee set to Kyle Knoepfel
  • Priority changed from Normal to High

#7 Updated by Kyle Knoepfel over 4 years ago

  • % Done changed from 0 to 20
  • Estimated time changed from 8.00 h to 40.00 h

We have investigated this issue at length, and we believe we understand what is happening. According to an allinea map profile, the memory grows linearly at a gradual rate until the end of the job, where a spike of roughly 140 MB is seen. The memory growth results from calls to TTree::Fill() and TBranch::Fill(). The default ROOT behavior, which art uses, is to impose no memory limit on any trees that are being filled. After each call to Fill(), a copy of the data is made to the TBasket buffer, the memory of which is not deallocated until the end of the process.

Gradual memory growth

[ Update: The reason for the gradual memory growth as stated here is now known to be incorrect. See #4946-9 below. ]

The gradual memory growth results from calls to TTree::Fill() at the per-event level. As there is no memory limit on the TTree objects in art, the memory grows until the end of the process. There are ways to flush/delete the cached TBasket buffers, but the current ROOT interface requires setting a maximum virtual memory size per TTree. The appropriate value for this limit depends on the expected size of the stored objects in the TTree and can thus be difficult to estimate ahead of time. Ideally we would allow only one TBasket in memory per input/output branch. However, until such a feature is available (or devisable by us), the user can specify a hard-coded limit in the FHiCL file:

outputs: {
  out1: {
    module_type : RootOutput
    fileName : "out1.root" 
    treeMaxVirtualSize : 50000000 # 50e6-byte limit
  }
}

When the virtual memory (as measured by ROOT) of the TBasket buffers reaches that limit, the buffers flushed, and the cached buffers and associated memory are deallocated.

140 MB spike at end

The spike at the end results from writing the art::FileIndex metadata to the output file. The FileIndex contains a vector of indices that is used for accessing the events in a downstream process. We estimate that the stored number of art::FileIndex bytes per event is on the order of 50 bytes. Running over the set of files you pointed us to, that amounts to roughly 66 MB. Before the data are compressed, ROOT copies the data, bringing it up to 132 MB or so. Accounting for the compression map then brings things in line with the observed 140 MB spike. The eventual compression on the art::FileIndex, however, is a factor of 8 or so, resulting in an on-disk art::FileIndex of roughly 8 MB (or 6 bytes/event).

Note that the number of subruns is largely immaterial. This spike at the end is primarily a linear function of the number of events that are stored in the final output.

The reason for the spike is that the entire art::FileIndex is written to disk at one time during one call to TBranch::Fill(). This is consistent with how the other metadata are stored. However, in order to suppress such a memory spike at the end, we are willing to create a new TTree object corresponding to the art::FileIndex, whose branch is of the type of entry stored in the art::FileIndex. We can then fill the branch one entry at a time, and flush/deallocate cached buffers whenever the size exceeds the expected basket size: on the order of 16 KB. The spike should then be immaterial compared to the other memory requirements of the job.

The time estimate of 40 hours corresponds only to the time estimated to address the memory spike at the end. Addressing the gradual memory growth issue will take more effort.

#8 Updated by Kyle Knoepfel over 4 years ago

  • % Done changed from 20 to 70

The memory spike at the end of the process has been removed by placing the art::FileIndex inside a TTree object with TBranch type of art::FileIndex::Element. A new fileFormatVersion was necessary for this change, bumping the version from 6 to 7. art will correctly load the FileIndex in old files -- i.e. this change is backwards compatible. Note, however, that files produced with file format version 7 are not readable by older versions of art.

The total virtual memory usage for this job decreased from 462 MB to 329 MB.

We are continuing to explore the reason behind the gradual memory growth, which is partially understood at the moment.

Implemented with art:3998a8ec3ce0a78ebb3548fb9a3de4ada4444a36.

#9 Updated by Kyle Knoepfel over 4 years ago

  • Status changed from Accepted to Resolved
  • % Done changed from 70 to 100

As discussed at the stakeholders meeting yesterday, the removal of the memory spike is sufficient to consider this issue resolved. An additional issue will be opened that corresponds specifically to the gradual memory growth, part of which has been difficult to assess. The analysis of that behavior up till this point is discussed in issue #8426.

#10 Updated by Christopher Green over 4 years ago

#11 Updated by Kyle Knoepfel over 4 years ago

  • Status changed from Resolved to Closed

#12 Updated by Kyle Knoepfel over 4 years ago

  • Related to Feature #8426: Understand gradual memory growth and determine if it can be mitigated added


Also available in: Atom PDF