Products written to SubRun are leaked
The attached module writes a 1GB data product in endSubRun.
Run it over a collection of files with many subruns (I'm using /nova/data/calibration/FarDetectorData/cosmics/run*/pchit_fardet_r*). Every subrun the job memory usage increases by 1GB. It never falls.
If you move the code into produce() so that the product is written to every Event, then there is no such leak.
This bug is important, it makes it impossible to run the nova calibration code over our Far Detector data or MC.
#1 Updated by Christopher Green about 7 years ago
- Tracker changed from Bug to Support
- Category set to I/O
- Status changed from New to Feedback
- Assignee set to Christopher Green
- % Done changed from 0 to 90
- Experiment NOvA added
- Experiment deleted (
This is actually the intended behavior for the default running mode of art. This is not a leak per se (where the access to used resources is lost), but the subrun objects are cached until the output file is closed in order to avoid fragmentation in the case where events from the same subrun are seen later.
If you can be sure that no subrun is mentioned in more than one file, then one can set the parameter
services.scheduler.fileMode: MERGE, which should cause the subruns and their associated objects not to be cached for the duration of the output file. However, should a subrun be seen a second time, a second subrun record will be written to the output file.
Please let me know if this works for you.
#2 Updated by Christopher Backhouse about 7 years ago
I added that line to my leaktestjob.fcl, and now it hangs indefinitely (> several minutes) at the end of the first subrun, with 100% CPU usage.
I reduced my payload to 64MB and it still happens.
In any case, shouldn't the size of this cache be limited in some way? I imagine fragmentation is not a big problem if it's in, say, 100MB blocks.
#4 Updated by Christopher Green about 7 years ago
- % Done changed from 90 to 0
- SSI Package art added
- SSI Package deleted (
At this point I believe it is appropriate to attempt to reproduce this problem with a debugger. We need someone from NOvA based at Fermilab to sit with us and investigate this problem.
#6 Updated by Christopher Green about 7 years ago
- Status changed from Feedback to Resolved
- % Done changed from 0 to 100
During a visit from Chris B. and Gavin, we were able to ascertain that the time is being taken by ROOT to compress the data. This may be turned off per-output stream with the
compressionLevel parameter of the output module, or per data product by setting the compression in
<class name="art::Wrapper<arttest::CompressedIntProduct>" compression="9"/>Note that the compression level is to be set on the entry for the wrapper, not the wrapped data product. One further note: in order to avoid affecting too much, you may need to wrap simple objects -- eg instead of
vector<int>, you should have
vector<MyStuff>, with the struct