Project

General

Profile

Feature #5849

Investigate memory behaviour of toy attenuation calibration jobs

Added by Christopher Backhouse over 6 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
High
Category:
I/O
Target version:
Start date:
04/08/2014
Due date:
% Done:

100%

Estimated time:
32.00 h
Spent time:
Scope:
Internal
Experiment:
-
SSI Package:
art
Duration:

Description

Hi,

I've made a greatly-simplified version of some of the NOvA calibration code at /nova/app/users/bckhouse/dev_mem/Calibration/ to demonstrate some issues I've been encountering.

To set up within the novasoft framework:

source /grid/fermiapp/nova/novaart/novasvn/srt/srt.sh
export EXTERNALS=/nusoft/app/externals
source $SRT_DIST/setup/setup_novasoft.sh
cd /nova/app/users/bckhouse/dev_mem/Calibration/
srt_setup -a

(

The mock data product is in AttenProfiles.h, simply a ~70MB array. This is what the calibration data for a single FD block comes to.

CosmicCalib_module.cc creates 28 of these (one for each block) and puts them in an output file. That comes to 2GB, large but manageable. Of course in reality this module runs over input files, but we can run it with just:

nova -c cosmiccalibjob.fcl -n1

The output file is indeed 2GB (I turned compression off). Watching the job run with "top", endSubRun runs with a total usage of 2GB, but then afterwards the usage rises to 4GB before the job ends.

Presumably the product data is copied at some point (not when I std::move it to hand it to ART, presumably when it's handed off to the ROOT output code?). I'd prefer it was copied independently in those 28 chunks, instead of all in one go, instead of in the 28 pieces, but this is comprehensible.

The next step is combining the outputs from many different subruns. I'd intended to create many instances of the attenprof output file and sum them up, but actually I can demonstrate strange behaviour with only one:

nova -c sumsubrunscalibjob.fcl attenprofs_r00000001_s01.root

Memory usage rises to 4GB before endSubRun is even called. endSubRun itself raises the usage to 6GB, which makes sense since the contents of the 2GB input file are now being stored in fChannelMapProf. Then, before endRun is called, usage rises to 8GB and I get a segfault.

Why is this job so different to the earlier cosmiccalibjob? I would expect the same behaviour of 2GB usage until the file is written out, rising to 4GB for some copy that occurs there.

Memory usage of the real equivalent of these jobs is causing problems right now for calibration of approximately 1/4 of the full detector volume. In an ideal world the usage of either job shouldn't exceed about 2GB.

All the values quoted above are Resident memory. I noticed that the Virtual memory value was often much higher. Normally Resident would be the correct thing to measure, but the memory limits in place for grid jobs appear to key off the Virtual figure, and kill jobs exceeding 4GB.

copy-20140417-02.cfg (1.67 KB) copy-20140417-02.cfg Gianluca Petrillo, 04/17/2014 08:00 PM

Related issues

Related to fhicl-cpp - Bug #6052: FHiCL type traits is_int, is_uint and is_numeric misbehave under Mac OS X.Closed04/28/2014

Related to art - Bug #6643: large memory cost for simple art job in nova offlineClosed07/18/2014

History

#1 Updated by Christopher Green about 6 years ago

  • Category set to I/O
  • Status changed from New to Accepted
  • Estimated time set to 4.00 h
  • SSI Package art added
  • SSI Package deleted ()

We will look at this as soon as we can. The time estimate relates only to the initial analysis.

#2 Updated by Christopher Green about 6 years ago

  • Status changed from Accepted to Assigned
  • Assignee set to Christopher Green
  • Target version set to 1.10.00
  • % Done changed from 0 to 60
  • Estimated time changed from 4.00 h to 32.00 h
After investigation with valgrind's "massif" tool, I can report the following:
  1. In the first job, the writing of the SubRun product, the total memory usage is 6GiB. This can be reduced to 4GiB if you use a data product which is small when empty (e.g. by using std::vector<float> instead of @float[]).
  2. In the second job, the reading of SubRun products and the writing of the Run product, the total memory usage is !0GiB. Again, this can be reduced to 8GiB.
The usage is as follows (where applicable):
  1. 2GiB for the set of to-be written (not yet put) data products.
  2. 2GiB for the incoming ROOT data buffers.
  3. 2GiB for the user-code SubRun products.
  4. 2GiB for the user-code accumulated Run products.
  5. 2GiB for the outgoing ROOT data buffers.

(1) can be ameliorated by designing a product with a small empty footprint. In addition, I have made a change to the art::Wrapper code in art which won't save any memory, but does replace a default construction of a data product followed by an assignment with a copy construction instead by putting an error check in the initializer instead of the body of the art::Wrapper constructor.

(2) can be ameliorated without change to ROOT by providing an option to drop buffers after each branch has been read. This will increase the allocation and deletion of heap memory.

(3) and (4) are irreducible.

(5) cannot be ameliorated in the current version of ROOT, but a fresh commit has been made to the patches branch for a forthcoming release which will make it possible to drop branches after writing.

With respect to the measures possible for (2) and (5):

Dropping baskets after every branch read (2) will increase memory allocations and therefore slow things down somewhat. Flushing and dropping baskets after every branch written (5), however, will additionally create lots of small baskets on the file if the product is small, which will drastically increase ROOT's metadata needs and therefore memory use. It is therefore important in any implementation of measures for reducing (2) and (5) that the measure is only applied to "large" objects i.e. those which take up an entire (likely expanded) basket.

So, to summarize:

1. You can reduce the amount of memory you use right now by:
  • changing your product data structure to be small when empty; and
  • Ensuring that your product has a meaningful move copy constructor. One will be generated by default unless you define a non-default assignment, copy or destructor (rule of 5), but it may not do anything different to the copy constructor if your data product's members are not themselves movable (e.g. a C-style array is not; std::vector is).

2. I can implement a non-default output file option to drop large buffers after read / write in the next release of art. The read feature will work with the current release of ROOT; the write feature will require an as-yet unreleased version of ROOT.

One final note: your example as referenced above is not dropping the SubRun products from the Run-level output file: you should do this to reduce the total size-on disk of your Run-level output file.

#3 Updated by Christopher Backhouse about 6 years ago

Thanks for looking into this so quickly, and the detailed writeup.

I'm not sure I quite understand what all 5 memory users are. It might help to talk in the presence of a whiteboard. I'm at the lab this week and next, although I am on shift.

The data product used in practice is something like a std::map<int,float[1500]>.
Presumably that's small when empty. Though I'm not sure that helps
Will it have a reasonable default move constructor? Is there an easy way to tell?

The ability to have the buffers dropped sounds good. So, with the block-by-block division in my example it would effectively cut the usage by 1/28 for each of these (3.85G off the total)?

Don't I read the SubRun data diblock-by-diblock and immediately sum it into the Run data? Obviously 4) is irreducible, but shouldn't I in principle only have 1/28 of 3) in memory at a time? Would the read buffer dropping feature (plus new ROOT) achieve that?

We are dropping the subrun products, I must have simplified it out of my example.

In my example as written (and I believe in the real code), I realize that I accumulate into fChannelMapProf, but when it comes time to write it out I make a unique_ptr copy (cmp) only to immediately move() that to the Run. Presumably it would be better for fChannelMapProf to consist of unique_ptrs in the first place, and to hand those directly to the Run? But again, this is only done in block-sized chunks, so 27/28 of the potential wastage shouldn't be realized.

#4 Updated by Gianluca Petrillo about 6 years ago

[edit] I realized that in the case described by Christopher the new data is 2 GB, so the case I reported here and in my next comment is not in the same league. Sorry for the noise...

I have done some homework too.
A LArSoft job which has basically only the RootOutput module manages to use 1.5 GB of memory (the input is a detsim file, which has still to go through reconstruction).
That is only to copy branches, which it should not even bother to unpack. This is related to categories 2 and 5.
Ideally, copying the original branches should take as time and space as a cp.

Is this something that the set of patches you mention can fix?

One of the topics we discussed is that the unpacking could be needed for upgrading the version of the objects stored in the tree.
That is a feature that is "rarely" needed, and we know times when it is not needed for sure (for example, all the job sequences starting from a EmptyEvent, e.g. in MC production).
If it's not possible to establish automatically whether an upgrade would occur (which is the case, according to what I understand), then the option of disabling forcibly the upgrade to gain the cp-speed/size copy would be priceless.

#5 Updated by Christopher Backhouse about 6 years ago

You can control whether or not ART tries to unpack/upgrade/repack products with fastCloning: true or fastCloning: false. False is the default, so you should be getting the speed improvements of not having to inspect the data, but apparently it's still all travelling through large memory buffers for you.

#6 Updated by Gianluca Petrillo about 6 years ago

I attach for reference the dump of the FCL file I used.
The input file is accessible from the MicroBooNE virtual machines (uboonegpvm0[1-6]), and I can provide a copy if needed.
I have a memory profile (massif) at /uboone/data/users/petrillo/LArSoft/develop/e4_prof/logs/copy/20140417-02/copy-20140417-02-massif.out (67MB... I overdid a bit with massif settings).

Christopher: I remember Chris talking about that option. A quick glance at art/Framework/IO/Root/RootOutput_module.cc suggests me that fastCloning is actually enabled by default... I am not sure how to take it.

#7 Updated by Christopher Backhouse about 6 years ago

Sorry, yes, "true".

#8 Updated by Christopher Green about 6 years ago

I'm not sure I quite understand what all 5 memory users are. It might help to talk in the presence of a whiteboard. I'm at the lab this week and next, although I am on shift.

1. 2GiB for the set of to-be written (not yet put) data products.

This is the art::Wrapper<Obj> which will receive the object that you put().

2. 2GiB for the incoming ROOT data buffers.

Data come off disk into this buffer, for possible decompression.

3. 2GiB for the user-code SubRun products.

The object that you create.

4. 2GiB for the user-code accumulated Run products.

The object that you create.

5. 2GiB for the outgoing ROOT data buffers.

This is the staging point for the data in your put() object, prior to possible compression.

The ROOT buffers are for blocking operations: a given buffer (TBasket) may hold multiple (small) objects for a given branch over multiple branches, and only be written when full.

The data product used in practice is something like a std::map<int,float[1500]>.
Presumably that's small when empty. Though I'm not sure that helps
Will it have a reasonable default move constructor? Is there an easy way to tell?

Actually, an std::map is one of the most inefficient things (for both space and speed) that you can store in a ROOT file: almost anything is better, although std::map (all the STL containers, in fact) does have a useful move copy constructor (and assignment). However, I see from your example that you're using the std::map as temporary storage only, so the point is moot except for future reference. I incorporated a changed example as an art test: there, the SubRun- and Run-level products are identical and written out by the block i.e. there are 28 Run-level products. The accumulation structure in my test is in fact an std::vector<std::unique_ptr<block_data> >. The std::map's constructor doesn't help much in your example however, in the face of the non-movable data in your object itself.

The ability to have the buffers dropped sounds good. So, with the block-by-block division in my example it would effectively cut the usage by 1/28 for each of these (3.85G off the total)?

That's the plan.

Don't I read the SubRun data diblock-by-diblock and immediately sum it into the Run data? Obviously 4) is irreducible, but shouldn't I in principle only have 1/28 of 3) in memory at a time? Would the read buffer dropping feature (plus new ROOT) achieve that?

Once data have been read into the art::Event (or art::SubRun, or ...), they hang around for the entire <unit>: they might be used by a different module. It's conceivable that we could provide (e.g.) a Handle::reset() to force the memory to be cleared and re-read from disk if required, though like the buffer deletion it must be used judiciously.

We are dropping the subrun products, I must have simplified it out of my example.

Fair enough. I added the drop command for the test.

In my example as written (and I believe in the real code), I realize that I accumulate into fChannelMapProf, but when it comes time to write it out I make a unique_ptr copy (cmp) only to immediately move() that to the Run. Presumably it would be better for fChannelMapProf to consist of unique_ptrs in the first place, and to hand those directly to the Run? But again, this is only done in block-sized chunks, so 27/28 of the potential wastage shouldn't be realized.

That's a reasonable optimization, and what I did in the test.

#9 Updated by Christopher Green about 6 years ago

  • Tracker changed from Bug to Feature
  • % Done changed from 60 to 80

I have implemented art::{Event,SubRun,Run}::removeCachedProduct(art::Handle<T> & h) to drop a cached product from memory. If a downstream module requires the data it will be re-read from the file, obviously at a time and memory allocation cost. This has been tested and the memory saving verified with massif.

The write baskets may be optionally dropped based on an object size threshold (this feature will automatically start to work when the fixed ROOT release comes out).

It still remains to implement an analogous feature for the ROOT read baskets.

#10 Updated by Christopher Green about 6 years ago

  • Status changed from Assigned to Resolved
  • % Done changed from 80 to 100

This issue is resolved with 260d69b.

Expect a Wiki article on this, but the TL;DR is:

  • Design products that can be moved efficiently.
  • Avoid large data products if possible, as dealing with them will always be a trade-off between memory use and performance.
  • Products read from disk may be removed from cache after use with bool art::{Event,SubRun,Run}::removeCachedProduct(art::Handle & h). The handle will be cleared if the product is removed from cache, untouched otherwise. If they it used in a subsequent module it will be re-read. The function will return true if the product has been removed from cache; false if the product was produced rather than read from file and therefore not removed from cache.
  • Add the following FHiCL parameter to either the RootInput or relevant RootOutput configuration parameter sets:
    saveMemoryObjectThreshold: <nbytes>
  • Make sure to drop unwanted products from the output.

When all of these features are activated / advice followed, the SubRun-level writing job will take 2 + ε GiB instead of 6 GiB, and the accumulation job will take 2 + 3ε GiB instead of 10 GiB. I will re-iterate, however, that the dropping and reallocation of memory will certainly have a performance cost, and the dropping of write baskets in particular has implications for the efficiency of storage in the file and the reading, writing and compression performance.

#11 Updated by Marc Paterno about 6 years ago

  • Status changed from Resolved to Closed

#12 Updated by Christopher Green about 6 years ago

  • Target version changed from 1.10.00 to 1.09.03

#13 Updated by Christopher Backhouse about 6 years ago

I implemented these points into my toy at /nova/app/users/bckhouse/dev_mem/ and both jobs now max out at 2.1G. Thanks!



Also available in: Atom PDF