Memory deallocation error running MicroBooNE reconstruction
Reported via e-mail by Herbert Greenlee:
In testing mcc8, I am observing a crash in about 1% of events in the destructor of the sparse_vector owned by recob::Wire. The crash is due to a bad deallocate.
Steps to reproduce.
I am running off of base release larsoft v06_26_01_02. I have checked out uboonecode branch v06_26_01_br and larreco branch v06_26_01_01_branch, each with some local updates, which may or may not be affecting the crash.
I do not have lardataobj checked out.
My source area is here:
A tarball containing compiles binaries is here:
Here are two crashing events:
#2 Updated by Herbert Greenlee about 2 years ago
Latest fcl file is reco_uboone_data_mcc8_driver_stage1and2_test4.fcl
in directory /uboone/app/users/greenlee/mcc8rel/srcs/uboonecode/fcl/reco
Crashing files from with this fcl file are as follows.
#3 Updated by Gianluca Petrillo about 2 years ago
- Category set to Reconstruction
- Status changed from New to Assigned
- Assignee set to Gianluca Petrillo
So far, I failed to reproduce the crash with a
debug version using the public
v06_26_01_br branch and the FHiCL file specified, as copied from its location, running on the first input file in the first report and on the first file of note 2.
I still have to try with a profiling version, and then with the full working area specified in the report, before claiming irreproducibility.
#4 Updated by Gianluca Petrillo about 2 years ago
- Status changed from Assigned to Feedback
I could not reproduce the problem using either:
as input files and a working area with only
v06_26_01_br), running the FHiCL file copied from
/uboone/app/users/greenlee/mcc8rel/srcs/uboonecode/fcl/reco/reco_uboone_data_mcc8_driver_stage1and2_test4.fclon a SCD machine (
Please publish feature branches with the complete code reproducing the issue. That will also give me a place where to push fixes.
#5 Updated by Herbert Greenlee about 2 years ago
I will also post this message to issue 16778.
I think the reason you could not reproduce this crash is because you
didn't have all of the local patches and upates from my test release. As
you requested, I have now pushed these, along with all job fcl files, to
the following branches in the following git repositories.
If you check out these two branches and build against base release larsoft
v06_26_01_02, you should have the identical code as is crashing for me.
I have a set of crashing input files from my latest set of tests using job
fcl file reco_uboone_data_mcc8_driver_stage1and2_test6.fcl
I verified that the first file crashes for me reliably interactively for
both prof and debug build. Here is the error message.
================================================================================================ ==== MemoryTracker General SUMMARY (all numbers in units of Mbytes) Peak virtual memory usage (VmPeak) : 2258.38 Mbytes Peak resident set size usage (VmHWM): 1476.57 Mbytes ProcessStep Module ID Vsize RSS ================================================================================================ ====== Module Construction out1:RootOutput 0 0.121 Module Construction rns:RandomNumberSaver 0 0.090 . . . Module Construction pmtrackT0RecoLoose:T0RecoAnodeCathodePiercing 0 0 Aborted
#7 Updated by Gianluca Petrillo about 2 years ago
So far, I have collected the following evidence:
- the crash is caused within
- the crash that happens when using as input the raw digits from noise removal does not happen when using the original raw digits from
- no crash happens when using
- valgrind/memcheck does not expose any foul play
#8 Updated by Gianluca Petrillo about 2 years ago
Bisection revealed that the crash does not happen in uboonecode:39607de940f656226c8dad07fbc90f78ab5c0794, does happen in uboonecode:3cc0b0a511e697a8a5878883c49302553f54c381 .
Unfortunately the three commits in between are not compilable.
Working on making them compilable.
#9 Updated by Gianluca Petrillo about 2 years ago
- Experiment MicroBooNE added
- Experiment deleted (
The crash happens only when using
BaselineMostProbAve baseline algorithm.
An unrelated small bug prevented
BaselineStandard algorithm from being selected; that has been fixed in commit:ce9854a9e65d4045f65dce9adc0b3f8abc16641d .
I don't know yet if
BaselineMostProbAve is bad by itself or if it just triggers something else.
#11 Updated by Gianluca Petrillo about 2 years ago
- Assignee changed from Gianluca Petrillo to Tracy Usher
I have pushed an updated version of
greenlee_deallocate branch with some recommended changes (the bug fix mentioned above, and added constantness to input arguments of the tools).
This will not solve the issue. See the next note...
#12 Updated by Gianluca Petrillo about 2 years ago
- % Done changed from 10 to 50
The issue is triggered by
uboonecode/uboone/CalData/DeconTools/BaselineMostProbAve_tool.cc line 61 and following.
This can be observed replacing
roiHistVec[std::floor(2. * (holder[binIdx] - min))]++; with
roiHistVec.at(std::floor(2. * (holder[binIdx] - min)))++; (change that is not recommended unless debugging:
holder, of size
nbin, can be not large enough to host all the entries. This turns into the code trying to write beyond the limits of that array ("overrun").
The mode of failure for a buffer overrun can be anything. In the reported case, the overrun area hosted heap pointers that are used when deallocating memory, and that memory was assigned to a region of a
sparse_vector object, therefore destructing that object generated an invalid access. It is possible that the overrun has no consequence, if instead the overrun memory happens to be not allocated; the vector will still be shorter than due, so the physics results will be incorrect.
It is also not obvious to me that this overrun happens every time, nevertheless the most conservative assumption is that even if a job has succeeded, its results might be wrong.
Reassigning to the author.
#13 Updated by Gianluca Petrillo about 2 years ago
- Status changed from Work in progress to Resolved
Picking it back after Tracy request for convenience.
I just extended the too-short vector by one unit.
It turns out the failure comes when the range of the ADC counts in the region of interest is exactly an integer (probably less than just an accident). In that case, the histogram, implemented as a vector, is not large enough to accommodate the maximum value, and overrun occurs.
Pushed on the branch
greenlee_deallocate as uboonecode:8e8c7b51c3ed406f906fb47f844e7e8720b8be92 (together with a slight modernisation of the code).
I have tested it with a single failing file among the ones specified by Herbert, also with a modified code which would throw an exception instead of silently overrunning the array boundary.
Ready for extensive test.
The code in
develop also needs to be fixed, which I did not.
#15 Updated by Tracy Usher about 2 years ago
Thank you Gianluca for finding and recommending fix!
I have patched the version on the feature branch off develop (using art tools). I didn't include exception checking but hopefully this is no longer necessary.
Ah, I'm reminded to make the input vector const... will do that straight away.