Project

General

Profile

Bug #6643

large memory cost for simple art job in nova offline

Added by Alexander Radovic almost 6 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
High
Category:
I/O
Target version:
Start date:
07/18/2014
Due date:
% Done:

100%

Estimated time:
16.00 h
Spent time:
Occurs In:
Scope:
Internal
Experiment:
NOvA
SSI Package:
art
Duration:

Description

Hey there Artists,

This is particularly directed at Marc Paterno and Chris Greene but I was asked to send this email to the entire mailing list.

We have a simple art job in the calibration working group in nova, sumsubrunscalibjob.fcl in the Calibration nova offline software package. It is designed to sum together AttenProfiles objects:

http://nusoft.fnal.gov/nova/novasoft/doxygen/html/AttenProfiles_8h_source.html

where there are three arrays of 100 floats for each attenprofile, five attenprofiles for each cell in the NOvA far detector, 12288 cells per block of the detector, and 28 blocks, such that one would naively expect a maximum memory cost of:

32*12*32*28 # cells in the detector
(32*12*32*28)*(100*5*3*4) # each cell has got 5 profiles, of 100 bins, each bin is 3 floats
(32*12*32*28)*(100*5*3*4)/(1024*1024*1024) # convert to GB = 1.92GB

Instead the job will regularly use 6.4GB of memory, and example valgrind massif output is attached.

We have discussed this problem with the fnal artist group before and were instructed to use saveMemoryObjectThreshold to make sure that root purged it's before more regularly. Already set to 100MB if I set this threshold a very agressive 10MB we see some small improvement (valgrind output attached) but the job still hase a peak cost of 4.7GB.

If an artists could find the time to take a look at our code but any other memory intensive operations I would be very grateful.

The software I have been working with is built out of the latest nova development release with an optimized compiler. The job files is sumsubrunscalibjob.fcl in the Calibration nova offline software package. Example files to run over can be found in:

/nova/ana/users/radovic/exampleAttenPC/*atten*root

Please let me know if you have any questions. I will attempt to find time to run the art SimpleMemoryChecker on an example job and will reply to this email with any new information.

best,

-Alexander

stansumjobval.txt (548 KB) stansumjobval.txt Alexander Radovic, 07/18/2014 12:02 PM
lowmemthreshjobval.txt (155 KB) lowmemthreshjobval.txt Alexander Radovic, 07/18/2014 12:02 PM

Related issues

Related to art - Feature #5849: Investigate memory behaviour of toy attenuation calibration jobsClosed04/08/2014

History

#1 Updated by Alexander Radovic almost 6 years ago

Output from SimpleMemoryChecker in the 10MB threshold case:

"
MemoryReport> Peak virtual size 3689.22 Mbytes
Key events increasing vsize:
[0] run: INVALID subRun: INVALID event: INVALID vsize = 0 deltaVsize = 0 rss = 0 delta 0
[0] run: INVALID subRun: INVALID event: INVALID vsize = 0 deltaVsize = 0 rss = 0 delta 0
[1] run: 14702 subRun: 2 event: 10972 vsize = 3689.22 deltaVsize = 0 rss = 3016.57 delta 0
[0] run: INVALID subRun: INVALID event: INVALID vsize = 0 deltaVsize = 0 rss = 0 delta 0
[0] run: INVALID subRun: INVALID event: INVALID vsize = 0 deltaVsize = 0 rss = 0 delta 0
[0] run: INVALID subRun: INVALID event: INVALID vsize = 0 deltaVsize = 0 rss = 0 delta 0
[0] run: INVALID subRun: INVALID event: INVALID vsize = 0 deltaVsize = 0 rss = 0 delta 0
[1] run: 14702 subRun: 2 event: 10972 vsize = 3689.22 deltaVsize = 0 rss = 3016.57 delta 0
Art has completed and will exit with status 0.
"

It did not seem terribly enlightening to me but I thought I should attach it in case it meant more to someone else. I'm not sure why it reports such a low virtual size compare to valgrind or top.

#2 Updated by Christopher Green almost 6 years ago

  • Category set to User Code
  • Status changed from New to Assigned
  • Assignee set to Christopher Green
  • Target version set to 1.11.00
  • Estimated time set to 16.00 h

We will reproduce and characterize your problem using the information you have provided, and contact you with information or questions as soon as we have something for you.

#3 Updated by Christopher Green almost 6 years ago

  • Category changed from User Code to I/O
  • SSI Package art added
  • SSI Package deleted ()

Hi

After running some tests using your suggested example on novagpvm02:

  1. First, using massif and a 10MiB SaveMemoryObjectThreshold on both writing and reading (each product, as you say, is 90MiB, so this is the operative number), I have achieved 3.0GiB peak memory use with the example you provide.
  2. The bulk of the remaining memory usage comes from the fact that the products in the subrun records are all read into memory together when those records are first seen, rather than individually as they are used.

Unlike event products, which are only read when they are requested by e.g. getByLabel() calls, subrun and run products are generally read as soon as the record is encountered. This is because the data must be available for reading at endSubRun() time, which may be after the input file is closed.

If 3.0GiB is too large still for your needs, there is a way to solve the problem, but it is not an elegant or robust solution as it would involve:
  1. art providing an option to read subrun products on demand.
  2. modules reading subrun products in beginSubRun() rather than endSubRun(). This would require that they be read from a file rather than optionally being produced in the job, leading to the possibility of a mis-configured job.
  3. art taking special measures to be robust in the case of asking for a subrun product that was available, but is no longer due to the file being dropped.
  4. the requirement that all file-based subrun products must be dropped by all output modules.

The longer term solution is to re-work the concept of runs and subruns in the state machine to ensure that subruns and runs are ended before the file is closed.

Please let us know if 3.0GiB is sufficient for your needs, or if we should plan to provide the unsafe short-term solution for you to use until such time as the long-term solution is implemented.

#4 Updated by Christopher Backhouse almost 6 years ago

We certainly sometimes see usages that exceed 4GB, which is why this is a problem for us. Perhaps the specific files Alex provided are towards the small side.

I think 1-4 sounds like the way to go, though it would definitely be nice to rework things so that run/subrun objects are on-demand by default in both begin and endSubRun.

I didn't really understand the necessity for 4.

#5 Updated by Alexander Radovic almost 6 years ago

The files in:

/nova/ana/users/radovic/exampleAttenPC/*atten*root

are the same ones that Chris Backhouse gave me, and that we both got >4gb memory usages from, locally, even with a low memory threshold. I'm a little confused by why they would behave for Chris Greene and not for us.

Chris Greene, just to check, you did use the development release of NOvA and an optimized compiler?

Anyway I agree with Chris Backhouse, we should pull the trigger on 1-4.

cheers,
-Alexander

#6 Updated by Christopher Green almost 6 years ago

.4. is necessary because otherwise, the subrun products would still have to be copied to the output file, which may occur after the input file had been closed. If the products had not already been read by then, that is an error.

I did 5 events from one file, and I am basing my memory use measurement on the massif output. I can certainly do a test if you wish. I also did the default setup_nova, which I would guess is the debug version? I will rerun tests this morning with several files and let you know. It will take a while though, since programs are extremely slow when run under massif.

#7 Updated by Christopher Backhouse almost 6 years ago

Optimized build would be "setup_nova -b maxopt"

#8 Updated by Christopher Backhouse almost 6 years ago

So presumably right now subrun.removeCachedProduct() does nothing, because once the product is no longer cached there would be no way to get it back.
But with these changes that would start working as expected, right?

#9 Updated by Christopher Green almost 6 years ago

removeCachedProduct() does what it's supposed to, but it doesn't help because you've already read in all the subrun products (thus creating the memory spike) and now all you're doing is removing them from the cache one by one.

I will implement the art side of the four-point solution already described and test it against a private build of NOvA.

In other news, running over multiple files show the massif-reported usage peaking at 4.7GiB, with 40% of the usage coming from the product reading process. The ps-reported usage was somewhat higher (~6GiB) however, I believe due to static memory use: code text, dictionaries, other libraries and static data structures. It appears from a product dump that at least some of the files you pointed to have only half of the 28 subrun blocks present. There is also apparently a run-level product in the input, that I drop on input to remove its effect on the footrprint.

#10 Updated by Christopher Green almost 6 years ago

  • Status changed from Assigned to Resolved
  • % Done changed from 0 to 100

After more investigation, I have been able to ascertain that the art state machine attribute preventing safe delayed read of subrun and run products, namely the order of closing of input and output files, is an arbitrary historic decision, and there are no consequences to reversing that order.

That being true, I have committed code to the art repository with 1bc3d5d which reverses this order and provides two new boolean FHiCL parameters to the RootInput source: delayedReadSubRunProducts and delayedReadRunProducts, which currently default to false. Setting the former to true should be enough to bring the accumulation-related memory use in the job at issue down to of-order that of one copy of the data (that being accumulated for the outgoing run product).

No changes will be required in NOvA code; merely the use of the delayedReadSubRunProducts option.

#11 Updated by Kanika Sachdev almost 6 years ago

Could we please have an art patch with this fix to test against?

Thanks,
Kanika

#12 Updated by Christopher Green almost 6 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF