Reco1 memory leak
While doing some testing with the latest version of LArSoft, a memory leak seems to be appearing in the reco1 stage.
Tested with v06_38_00 (and also v06_37_00 yesterday). Testing area can be found here: /uboone/app/users/lorena/v06_38_00_clean
this is a clean LArSoft v06_38_00 version, which has been used to create the input file as follows:
lar -c prodgenie_bnb_nu_uboone.fcl -n 2 (output file is: prodgenie_bnb_nu_uboone_20170606T133725_gen.root)
lar -c standard_g4_uboone.fcl -n 2 prodgenie_bnb_nu_uboone_20170606T133725_gen.root
lar -c standard_detsim_uboone.fcl -n 2 prodgenie_bnb_nu_uboone_20170606T133725_gen_20170606T133816_g4.root
The next stage,
lar -c reco_uboone_mcc7_driver_stage1.fcl -n 2 prodgenie_bnb_nu_uboone_20170606T133725_gen_20170606T133816_g4_20170606T134253_detsim.root
shows a problem and jobs are killed when running over ~10-20 events due to the memory they are using (more than 15GB).
To get further information I run it with valgrind for those 2 events, and the output can be found at /uboone/app/users/lorena/v06_38_00_clean/valgrind_reco_uboone_mcc7_driver_stage1_fnal.txt and also at http://www.hep.phy.cam.ac.uk/~escudero/valgrind_reco_uboone_mcc7_driver_stage1_fnal.txt
Valgrind leak summary shows an important leak of memory at the end:28497 LEAK SUMMARY:
28497 definitely lost: 821,585 bytes in 12,612 blocks
28497 indirectly lost: 787,094,708 bytes in 597,637 blocks
28497 possibly lost: 286,529,744 bytes in 30,981 blocks
28497 still reachable: 196,354,311 bytes in 240,301 blocks
28497 suppressed: 0 bytes in 0 blocks
Together with lots of:
Conditional jump or move depends on uninitialised value(s)
I understand that some of these are ROOT internal related, but seems a big leakage after only 2 events which causes problems when running in local machines and its making the jobs crash after ~10 events.
#2 Updated by Marc Paterno over 3 years ago
You may find the instructions at https://cdcvs.fnal.gov/redmine/projects/art/wiki/Getting_started_with_valgrind to be helpful in using Valgrind. In particular, a huge number of spurious complaints coming from ROOT code can be suppressed using the supplied "suppressions" file, by using the following flag to valgrind:
#4 Updated by Lorena Escudero sanchez over 3 years ago
I did try with the ROOT suppression option; but it doesn't seem to like the pointed suppression file though:
valgrind --suppressions=/grid/fermiapp/products/larsoft/root/v6_08_06d/Linux64bit+2.6-2.12-e14-nu-prof/etc/valgrind-root.supp --leak\20315 Memcheck, a memory error detector
-check=full --leak-resolution=high --num-callers=40 lar -c reco_uboone_mcc7_driver_stage1.fcl -n 2 prodgenie_bnb_nu_uboone_20170606T\
123056_gen_20170606T123207_g4_20170606T123900_detsim.root > valgrind_supp.txt 2>&1
20315 Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
20315 Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
20315 Command: lar -c reco_uboone_mcc7_driver_stage1.fcl -n 2 prodgenie_bnb_nu_uboone_20170606T123056_gen_20170606T123207_g4_201\
location should be "...", or should start with "fun:" or "obj:"
20315 FATAL: in suppressions file "/grid/fermiapp/products/larsoft/root/v6_08_06d/Linux64bit+2.6-2.12-e14-nu-prof/etc/valgrind-r\
oot.supp" near line 25:
20315 location should be "...", or should start with "fun:" or "obj:"
20315 exiting now.
#6 Updated by Gianluca Petrillo over 3 years ago
- Status changed from Assigned to Resolved
- % Done changed from 0 to 100
Thank you for the precise report.
Although I took extra steps to confirm it (running
massif tool of valgrind),
memcheck log you posted contained the culprit already.
It turns out that some calls to
TVirtualFFT::FFT() used the option
"K" (not sure about the reason). Without that option, ROOT creates a new FFT object (if needed), stores is as global and manages it. With that option, instead, the new FFT is handed to the user, who is responsible of its management, and global FFT is not touched.
So, in those cases, the FFT object needed to be deleted after use. This is in fact documented in ROOT.
This was in uboonecode:source:uboone/CalData/NoiseFilterAlgs/RawDigitFFTAlg.cxx, called by MicroBooNE module uboonecode:source:uboone/CalData/RawDigitFilterUBooNE_module.cc; the fix is now in commit uboonecode:3598a8288fa0be300869c722a32136a1e9a2b26b.
#7 Updated by Gianluca Petrillo over 3 years ago
Note that the same issue appears in uboonecode:source:uboone/DetSim/SimWireMicroBooNE_module.cc, where it is not deleted (bad), but it's not recreated at every raw digit, which prevents a serious memory leak.