Feature #4946
art memory use: multiple subruns
Description
Hello,
Some of Mu2e jobs need to process input datasets containing multiple
subruns (10,000 or so). These jobs show excessive memory consumption,
and get killed on the grid. Chris has suggested to set
services.scheduler.fileMode: MERGE
in the fcl file. (Our subruns are atomic - we never split them.) That
greatly reduced the memory footprint, but still did not solve the
problem completely.
The attached plot memgrowth.pdf shows the memory grows of a simplified
job. The horizontal axis of the plot is time, the vertical axis is
memory use in KiB. The data points were obtained by sampling "ps"
output with 1 second frequency, so the true value of the peak was
likely missed. The sampling began somewhere in the middle of the job.
This job contains only a single producer that reads a data product
from subRun and records a copy of this product in subRun. All other
data products are dropped on input. Still, there is a steady memory
growth through the duration of the job, and a large spike at the end.
The size of the subrun payload does not exceed 10 kb. There is no
justification for this job to require more than 400 MiB of RSS memory.
To reproduce:
- check out Mu2e Offline v4_0_6
- put the attached file TestSRProducer_module.cc into e.g. Analyses/src
- compile
- get the attached subRunTestMerge.fcl
- run the job:
mu2e -c subRunTestMerge.fcl /mu2e/data/tdr/beam/g4s3/tdr.beam.g4s3.ds23.1006a_1025a_1025c.12192325/good/*/beamFlashOnMothers.root
Andrei
Related issues
Associated revisions
History
#1 Updated by Andrei Gaponenko about 7 years ago
- File subRunTestMerge.fcl subRunTestMerge.fcl added
#2 Updated by Christopher Green about 7 years ago
- Category set to I/O
- Status changed from New to Accepted
- Target version set to 493
- Start date deleted (
11/12/2013) - Estimated time set to 8.00 h
- SSI Package art added
- SSI Package deleted (
)
The time estimate is for the initial analysis and diagnosis only.
#3 Updated by Christopher Green almost 7 years ago
- Target version changed from 493 to 1.09.00
#4 Updated by Christopher Green over 6 years ago
- Target version changed from 1.09.00 to 1.13.00
#5 Updated by Christopher Green almost 6 years ago
- Target version changed from 1.13.00 to 1.14.00
Due to the elevated priority of the upcoming release, the target version for this issue has been adjusted.
#6 Updated by Kyle Knoepfel almost 6 years ago
- Tracker changed from Bug to Feature
- Assignee set to Kyle Knoepfel
- Priority changed from Normal to High
#7 Updated by Kyle Knoepfel almost 6 years ago
- % Done changed from 0 to 20
- Estimated time changed from 8.00 h to 40.00 h
We have investigated this issue at length, and we believe we understand what is happening. According to an allinea map
profile, the memory grows linearly at a gradual rate until the end of the job, where a spike of roughly 140 MB is seen. The memory growth results from calls to TTree::Fill()
and TBranch::Fill()
. The default ROOT behavior, which art uses, is to impose no memory limit on any trees that are being filled. After each call to Fill()
, a copy of the data is made to the TBasket
buffer, the memory of which is not deallocated until the end of the process.
Gradual memory growth¶
[ Update: The reason for the gradual memory growth as stated here is now known to be incorrect. See #4946-9 below. ]
The gradual memory growth results from calls to TTree::Fill()
at the per-event level. As there is no memory limit on the TTree
objects in art, the memory grows until the end of the process. There are ways to flush/delete the cached TBasket
buffers, but the current ROOT interface requires setting a maximum virtual memory size per TTree
. The appropriate value for this limit depends on the expected size of the stored objects in the TTree
and can thus be difficult to estimate ahead of time. Ideally we would allow only one TBasket
in memory per input/output branch. However, until such a feature is available (or devisable by us), the user can specify a hard-coded limit in the FHiCL file:
outputs: {
out1: {
module_type : RootOutput
fileName : "out1.root"
treeMaxVirtualSize : 50000000 # 50e6-byte limit
}
}
When the virtual memory (as measured by ROOT) of the TBasket
buffers reaches that limit, the buffers flushed, and the cached buffers and associated memory are deallocated.
140 MB spike at end¶
The spike at the end results from writing the art::FileIndex
metadata to the output file. The FileIndex
contains a vector of indices that is used for accessing the events in a downstream process. We estimate that the stored number of art::FileIndex
bytes per event is on the order of 50 bytes. Running over the set of files you pointed us to, that amounts to roughly 66 MB. Before the data are compressed, ROOT copies the data, bringing it up to 132 MB or so. Accounting for the compression map then brings things in line with the observed 140 MB spike. The eventual compression on the art::FileIndex
, however, is a factor of 8 or so, resulting in an on-disk art::FileIndex
of roughly 8 MB (or 6 bytes/event).
Note that the number of subruns is largely immaterial. This spike at the end is primarily a linear function of the number of events that are stored in the final output.
The reason for the spike is that the entire art::FileIndex
is written to disk at one time during one call to TBranch::Fill()
. This is consistent with how the other metadata are stored. However, in order to suppress such a memory spike at the end, we are willing to create a new TTree
object corresponding to the art::FileIndex
, whose branch is of the type of entry stored in the art::FileIndex
. We can then fill the branch one entry at a time, and flush/deallocate cached buffers whenever the size exceeds the expected basket size: on the order of 16 KB. The spike should then be immaterial compared to the other memory requirements of the job.
The time estimate of 40 hours corresponds only to the time estimated to address the memory spike at the end. Addressing the gradual memory growth issue will take more effort.
#8 Updated by Kyle Knoepfel over 5 years ago
- % Done changed from 20 to 70
The memory spike at the end of the process has been removed by placing the art::FileIndex
inside a TTree
object with TBranch
type of art::FileIndex::Element
. A new fileFormatVersion
was necessary for this change, bumping the version from 6 to 7. art
will correctly load the FileIndex
in old files -- i.e. this change is backwards compatible. Note, however, that files produced with file format version 7 are not readable by older versions of art.
The total virtual memory usage for this job decreased from 462 MB to 329 MB.
We are continuing to explore the reason behind the gradual memory growth, which is partially understood at the moment.
Implemented with art:3998a8ec3ce0a78ebb3548fb9a3de4ada4444a36.
#9 Updated by Kyle Knoepfel over 5 years ago
- Status changed from Accepted to Resolved
- % Done changed from 70 to 100
As discussed at the stakeholders meeting yesterday, the removal of the memory spike is sufficient to consider this issue resolved. An additional issue will be opened that corresponds specifically to the gradual memory growth, part of which has been difficult to assess. The analysis of that behavior up till this point is discussed in issue #8426.
#10 Updated by Christopher Green over 5 years ago
- Related to Necessary Maintenance #8427: Remove unnecessary data mamber from art::EventID added
#11 Updated by Kyle Knoepfel over 5 years ago
- Status changed from Resolved to Closed
#12 Updated by Kyle Knoepfel over 5 years ago
- Related to Feature #8426: Understand gradual memory growth and determine if it can be mitigated added
Implement issue #4946 - address memory spike at end of process
Move FileIndex from metadata branch to its own tree
Change file format version to 7