Project

General

Profile

Bug #24660

Core dump in Event::getMany using an art::Selector

Added by Rob Kutschke 3 months ago. Updated 3 months ago.

Status:
Closed
Priority:
High
Assignee:
Category:
Infrastructure
Target version:
Start date:
07/23/2020
Due date:
% Done:

100%

Estimated time:
4.00 h
Spent time:
Occurs In:
Scope:
Internal
Experiment:
Mu2e
SSI Package:
art
Duration:

Description

In Mu2e Offline there is a core dump in a call to getMany with an art::Selector that is formed from the AND of two other selectors. This occurs with art v3_06_01.

I have an example in the repo

git clone https://github.com/kutschke/Offline.git

on the branch

art_v3_06_00_hack

To reproduce the bug, build the branch of Mu2e Offline in the normal way and run it with:

mu2e -c Mu2eG4/fcl/g4test_03.fcl  >& g4test_03.log

The core dump occurs in the call to getMany at the line:

https://github.com/kutschke/Offline/blob/3f6a689f8296180c6077d08bce7d97808a426ea7/TrackerMC/src/StrawDigisFromStrawGasSteps_module.cc#L512

I have instrumented the code before and after with flushed printout. The last line you should see is:

The crash comes inside here: ...

The traceback from gdb indicates an infinite loop that eventually leads to resource exhaustion:

(gdb) where
#0 0x00007fffd86e50a4 in art::SelectorBase::match (p=..., this=0x47e33a8) at /cvmfs/mu2e.opensciencegrid.org/artexternals/art/v3_06_01/include/art/Framework/Principal/SelectorBase.h:43
#1 art::AndHelper<art::Selector&, art::ModuleLabelSelector>::doMatch (p=..., this=0x47e3a68) at /cvmfs/mu2e.opensciencegrid.org/artexternals/art/v3_06_01/include/art/Framework/Principal/Selector.h:186
#2 art::SelectorBase::match (p=..., this=0x47e3a68) at /cvmfs/mu2e.opensciencegrid.org/artexternals/art/v3_06_01/include/art/Framework/Principal/SelectorBase.h:43
#3 art::ComposedSelectorWrapper<art::AndHelper<art::Selector&, art::ModuleLabelSelector> >::doMatch (this=0x47e3a60, p=...)
at /cvmfs/mu2e.opensciencegrid.org/artexternals/art/v3_06_01/include/art/Framework/Principal/Selector.h:296
#4 0x00007fffd86e50a7 in art::SelectorBase::match (p=..., this=<optimized out>) at /cvmfs/mu2e.opensciencegrid.org/artexternals/art/v3_06_01/include/art/Framework/Principal/SelectorBase.h:43
#5 art::AndHelper<art::Selector&, art::ModuleLabelSelector>::doMatch (p=..., this=0x47e3a68) at /cvmfs/mu2e.opensciencegrid.org/artexternals/art/v3_06_01/include/art/Framework/Principal/Selector.h:186
#6 art::SelectorBase::match (p=..., this=0x47e3a68) at /cvmfs/mu2e.opensciencegrid.org/artexternals/art/v3_06_01/include/art/Framework/Principal/SelectorBase.h:43
#7 art::ComposedSelectorWrapper<art::AndHelper<art::Selector&, art::ModuleLabelSelector> >::doMatch (this=0x47e3a60, p=...)
at /cvmfs/mu2e.opensciencegrid.org/artexternals/art/v3_06_01/include/art/Framework/Principal/Selector.h:296

This goes on for a few hundred thousand lines - I never reached the end of it. You see a script capture of the session at:

/home/kutschke/Mu2e/Offline/kutschke/Offline/debug.log

Note that art::AndHelper appears in every repeat of the traceback.

To make this example work I had to put in temporary hacks to work around two other bugs. The affected files are:

TrackerMC/src/MakeStrawGasSteps_module.cc
Mu2eG4/src/Mu2eWorld.cc

History

#1 Updated by Kyle Knoepfel 3 months ago

  • Estimated time set to 4.00 h
  • Assignee set to Kyle Knoepfel
  • Status changed from New to Assigned
  • Category set to Infrastructure

[Sigh.] Well, this is good and bad. Bad that this further delays a potential bug fix release, and good that it exposes where more tests need to be made in the framework.

This will receive priority tomorrow, along with issue #24659.

#2 Updated by Kyle Knoepfel 3 months ago

  • % Done changed from 0 to 100
  • Status changed from Assigned to Resolved

The problem arose because of a failed attempt at using "perfect forwarding" in commit art:060e6569. Due to complications regarding template deductions of references, the selector design has been reverted to a clearer ownership model. The perfect-forwarding approach can be tried again once it becomes apparent that its efficiency benefits outweigh the requisite code complexity. My apologies.

Implemented with commit art:b16c2ed3. I have verified that this commit solves the issue above, as indicated by the printout:

Begin processing the 1st record. run: 1 subRun: 0 event: 1 at 24-Jul-2020 14:44:32 CDT
The crash comes inside here: ...

... and we never see this  # [KJK: Now we do]
Begin processing the 2nd record. run: 1 subRun: 0 event: 2 at 24-Jul-2020 14:44:37 CDT
...
Begin processing the 3rd record. run: 1 subRun: 0 event: 3 at 24-Jul-2020 14:44:38 CDT
...

art 3.06.02 forthcoming.

#3 Updated by Kyle Knoepfel 3 months ago

  • Target version set to 3.06.02

#4 Updated by Kyle Knoepfel 3 months ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF