Project

General

Profile

Support #4492

Products written to SubRun are leaked

Added by Christopher Backhouse about 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
High
Category:
I/O
Target version:
-
Start date:
08/02/2013
Due date:
% Done:

100%

Estimated time:
Spent time:
Scope:
Internal
Experiment:
NOvA
SSI Package:
art
Duration:

Description

The attached module writes a 1GB data product in endSubRun.

Run it over a collection of files with many subruns (I'm using /nova/data/calibration/FarDetectorData/cosmics/run*/pchit_fardet_r*). Every subrun the job memory usage increases by 1GB. It never falls.

If you move the code into produce() so that the product is written to every Event, then there is no such leak.

This bug is important, it makes it impossible to run the nova calibration code over our Far Detector data or MC.

Leak_module.cc (1.31 KB) Leak_module.cc Christopher Backhouse, 08/02/2013 09:06 PM
Leak.fcl (57 Bytes) Leak.fcl Christopher Backhouse, 08/02/2013 09:06 PM
leakjob.fcl (372 Bytes) leakjob.fcl Christopher Backhouse, 08/02/2013 09:06 PM

History

#1 Updated by Christopher Green about 7 years ago

  • Tracker changed from Bug to Support
  • Category set to I/O
  • Status changed from New to Feedback
  • Assignee set to Christopher Green
  • % Done changed from 0 to 90
  • Experiment NOvA added
  • Experiment deleted (-)

This is actually the intended behavior for the default running mode of art. This is not a leak per se (where the access to used resources is lost), but the subrun objects are cached until the output file is closed in order to avoid fragmentation in the case where events from the same subrun are seen later.

If you can be sure that no subrun is mentioned in more than one file, then one can set the parameter services.scheduler.fileMode: MERGE, which should cause the subruns and their associated objects not to be cached for the duration of the output file. However, should a subrun be seen a second time, a second subrun record will be written to the output file.

Please let me know if this works for you.

#2 Updated by Christopher Backhouse about 7 years ago

I added that line to my leaktestjob.fcl, and now it hangs indefinitely (> several minutes) at the end of the first subrun, with 100% CPU usage.

I reduced my payload to 64MB and it still happens.

In any case, shouldn't the size of this cache be limited in some way? I imagine fragmentation is not a big problem if it's in, say, 100MB blocks.

#3 Updated by Christopher Backhouse about 7 years ago

With very small payload there's no bad hang. The turning point seems to be around 10MB.

#4 Updated by Christopher Green about 7 years ago

  • % Done changed from 90 to 0
  • SSI Package art added
  • SSI Package deleted ()

At this point I believe it is appropriate to attempt to reproduce this problem with a debugger. We need someone from NOvA based at Fermilab to sit with us and investigate this problem.

#5 Updated by Christopher Green about 7 years ago

Can we arrange to do this early next week? Monday 19th or Tuesday, 20th August? If Chris B. is not available Gavin has offered to demonstrate the problem. Please let us know what times are available.

#6 Updated by Christopher Green about 7 years ago

  • Status changed from Feedback to Resolved
  • % Done changed from 0 to 100

During a visit from Chris B. and Gavin, we were able to ascertain that the time is being taken by ROOT to compress the data. This may be turned off per-output stream with the compressionLevel parameter of the output module, or per data product by setting the compression in classes_def.xml, viz.:

<class name="art::Wrapper<arttest::CompressedIntProduct>" compression="9"/>
Note that the compression level is to be set on the entry for the wrapper, not the wrapped data product. One further note: in order to avoid affecting too much, you may need to wrap simple objects -- eg instead of vector<int>, you should have vector<MyStuff>, with the struct MyStuff containing an int.

#7 Updated by Christopher Green about 7 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF