Project

General

Profile

Bug #12648

Segmentation fault reading run products

Added by Gianluca Petrillo over 4 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
I/O
Target version:
-
Start date:
05/16/2016
Due date:
% Done:

100%

Estimated time:
2.00 h
Spent time:
Occurs In:
Scope:
Internal
Experiment:
LArSoft
SSI Package:
art
Duration:

Description

This was reported by MicroBooNE (Joseph Zennamo) to LArSoft.
A job produces a segmentation fault after attempting to read a run data product.
The issue is not triggered by the content of a specific input file, but it can be systematically reproduced.

Instructions on how to reproduce the issue will follow shortly.

filter_fhicl_file_Length25_simplified.fcl (540 Bytes) filter_fhicl_file_Length25_simplified.fcl Exercises only LArSoft Geometry service loading Gianluca Petrillo, 05/16/2016 12:59 PM
good_25cm_numuInc.list (242 KB) good_25cm_numuInc.list MicroBooNE data input list Gianluca Petrillo, 05/16/2016 01:00 PM

History

#1 Updated by Gianluca Petrillo over 4 years ago

I have simplified the problem so that the following job uses only:

  • LArSoft geo::Geometry service
  • art
  • MicroBooNE configuration and input data

Steps to reproduce (requires access to MicroBooNE dCache area; woof.fnal.gov has it):

source /products/setup
setup uboonecode v05_11_01 -q e9:prof
mkdir input
wget -O input/good_25cm_numuInc.list https://cdcvs.fnal.gov/redmine/attachments/download/34565/good_25cm_numuInc.list
mkdir job
wget -O job/filter_fhicl_file_Length25_simplified.fcl https://cdcvs.fnal.gov/redmine/attachments/download/34564/filter_fhicl_file_Length25_simplified.fcl 
lar -c job/filter_fhicl_file_Length25_simplified.fcl -S input/good_25cm_numuInc.list

The end of the output message I get is:

%MSG
Begin processing the 6th record. run: 6216 subRun: 12 event: 641 at 16-May-2016 13:07:58 CDT
TimeEvent> run: 6216 subRun: 12 event: 641 1.09673e-05
%MSG-w LoadNewGeometry:  PostSourceRun 16-May-2016 13:07:58 CDT run: 6218
cannot find sumdata::RunData object to grab detector name
this is expected if generating MC files
using default geometry from configuration file

%MSG
16-May-2016 13:07:58 CDT  Closed file /pnfs/uboone/scratch/users/jaz8600/numuInc_DATA_25cm_out/v05_11_00/10059810_998/PhysicsRun-2016_5_4_0_23_17-0006163-00048_20160504T072251_bnb_20160504T082406_merged_20160504T121310_reco1_20160504T132251_reco2_20160515T040145_wflash.root
16-May-2016 13:07:58 CDT  Initiating request to open file /pnfs/uboone/scratch/users/jaz8600/numuInc_DATA_25cm_out/v05_11_00/10059811_999/PhysicsRun-2016_4_24_19_17_11-0006058-00144_20160425T073253_bnb_20160501T134116_merged_20160501T171250_reco1_20160501T233543_reco2_20160515T010459_wflash.root
16-May-2016 13:07:58 CDT  Successfully opened file /pnfs/uboone/scratch/users/jaz8600/numuInc_DATA_25cm_out/v05_11_00/10059811_999/PhysicsRun-2016_4_24_19_17_11-0006058-00144_20160425T073253_bnb_20160501T134116_merged_20160501T171250_reco1_20160501T233543_reco2_20160515T010459_wflash.root
%MSG-w LoadNewGeometry:  PostSourceRun 16-May-2016 13:07:58 CDT run: 5985
cannot find sumdata::RunData object to grab detector name
this is expected if generating MC files
using default geometry from configuration file

%MSG
Begin processing the 7th record. run: 5985 subRun: 105 event: 5263 at 16-May-2016 13:07:58 CDT
TimeEvent> run: 5985 subRun: 105 event: 5263 1.00136e-05
%MSG-w LoadNewGeometry:  PostSourceRun 16-May-2016 13:07:58 CDT run: 6058
cannot find sumdata::RunData object to grab detector name
this is expected if generating MC files
using default geometry from configuration file

%MSG
%MSG-w LoadNewGeometry:  PostSourceRun 16-May-2016 13:07:58 CDT run: 6070
cannot find sumdata::RunData object to grab detector name
this is expected if generating MC files
using default geometry from configuration file

%MSG
%MSG-w LoadNewGeometry:  PostSourceRun 16-May-2016 13:07:58 CDT run: 6079
cannot find sumdata::RunData object to grab detector name
this is expected if generating MC files
using default geometry from configuration file

%MSG
%MSG-w LoadNewGeometry:  PostSourceRun 16-May-2016 13:07:58 CDT run: 6143
cannot find sumdata::RunData object to grab detector name
this is expected if generating MC files
using default geometry from configuration file

%MSG
Segmentation fault

The Geometry service loads the geometry specified in the configuration file.
Then on each new run it checks if the geometry has changed by reading a RunData object from the run.
If the object is not present, a warning id produced and no operation is supposed to take place.
The crash appears to be on a Run::getManyByType() call at larcore/larcore/Geometry/Geometry_service.cc line 70 (geo::Geometry::preBeginRun()).

LArSoft could use some help in solving this issue...

#2 Updated by Kyle Knoepfel over 4 years ago

  • Status changed from New to Accepted
  • Estimated time set to 2.00 h

We are investigating.

#3 Updated by Kyle Knoepfel over 4 years ago

  • Status changed from Accepted to Resolved
  • Assignee set to Kyle Knoepfel
  • % Done changed from 0 to 100

The failure is understood. It was sufficient to reproduce the error using two files:

  • PhysicsRun-2016_4_24_19_17_11-0006058-00144_20160425T073253_bnb_20160501T134116_merged_20160501T171250_reco1_20160501T233543_reco2_20160515T010459_wflash.root
  • PhysicsRun-2016_5_4_0_23_17-0006163-00048_20160504T072251_bnb_20160504T082406_merged_20160504T121310_reco1_20160504T132251_reco2_20160515T040145_wflash.root

What follows is a printout that shows the contents of each of the files:

==============================
File: PhysicsRun-2016_4_24_19_17_11-0006058-00144_20160425T073253_bnb_20160501T134116_merged_20160501T171250_reco1_20160501T233543_reco2_20160515T010459_wflash.root

Printing FileIndex contents.  This includes a list of all Runs, SubRuns
and Events stored in the root file.

            Run         SubRun          Event    TTree Entry
           5985                                            0  (Run)
           5985            104                             0  (SubRun)
           5985            105                             1  (SubRun)
           5985            105           5263              0
           6058                                            1  (Run)
           6058            144                             2  (SubRun)
           6058            145                             3  (SubRun)
           6070                                            2  (Run)
           6070             97                             4  (SubRun)
           6070             98                             5  (SubRun)
           6079                                            3  (Run)
           6079              8                             6  (SubRun)
           6079              9                             7  (SubRun)
           6143                                            4  (Run)
           6143             12                             8  (SubRun)
           6163                                            5  (Run)
6163 12 9 (SubRun)
6163 13 10 (SubRun)
6177 6 (Run)
6177 50 11 (SubRun)
6177 70 12 (SubRun)
6182 7 (Run)
6182 43 13 (SubRun)
6182 45 14 (SubRun)
6216 8 (Run)
6216 66 15 (SubRun)
6216 67 16 (SubRun)
6219 9 (Run)
6219 7 17 (SubRun)
6219 8 18 (SubRun)

and

==============================
File: PhysicsRun-2016_5_4_0_23_17-0006163-00048_20160504T072251_bnb_20160504T082406_merged_20160504T121310_reco1_20160504T132251_reco2_20160515T040145_wflash.root

Printing FileIndex contents.  This includes a list of all Runs, SubRuns
and Events stored in the root file.

            Run         SubRun          Event    TTree Entry
           5891                                            0  (Run)
           5891             95                             0  (SubRun)
           5891             96                             1  (SubRun)
           5936                                            1  (Run)
           5936              8                             2  (SubRun)
           5936              9                             3  (SubRun)
           5977                                            2  (Run)
           5977            105                             4  (SubRun)
           5977            105           5285              0
           5977            106                             5  (SubRun)
           5977            106           5302              1
           5977            106           5319              2
           5979                                            3  (Run)
           5979            156                             6  (SubRun)
           6043                                            4  (Run)
           6043             84                             7  (SubRun)
           6043             85                             8  (SubRun)
           6043             85           4258              3
           6139                                            5  (Run)
           6139             72                             9  (SubRun)
           6139             83                            10  (SubRun)
           6139             83           4167              5
           6162                                            6  (Run)
           6162             64                            11  (SubRun)
           6162             65                            12  (SubRun)
           6163                                            7  (Run)
6163 48 13 (SubRun)
6216 8 (Run)
6216 12 14 (SubRun)
6216 12 641 4
6216 13 15 (SubRun)
6218 9 (Run)
6218 96 16 (SubRun)
6218 97 17 (SubRun)

The sparseness of the file is not a problem. What is a problem, however, is that the runs 6163 and 6216 (highlighted in blue) are present in both files. The job is running in a mode where the "first run instance wins." In other words, the default processing mode is that all information associated with a given run corresponds to the first file in which it was seen.

In this particular case, whenever the second file is read and run 6163 is queried for the sumdata::POTSummary object, since that run appeared in an earlier file, the lookup corresponds to that earlier file. Since no product was present in the first file, an attempt is made to check if any secondary files were configured in which the product might be residing. Unfortunately, in order to ask that question, the framework must ask the first file for its list of configured secondary files. The first file has already been closed and no longer exists, eventually leading to the segmentation violation.

This type of problem has been brought to our attention several times in the last several weeks/months. With the soon-to-be released art:version:"Arcturus" feature release, this problem has been solved, and users will not encounter it. However, until MicroBooNE moves to that release, there are a few options:

  1. Make sure there are no Run and SubRun duplicates in your art/ROOT files, or
  2. Do not attempt to retrieve Run and SubRun products that are known to be absent in a given set of files, or
  3. If neither (1) or (2) are possible, then one can operate in the NOMERGE processing mode. To do so, the following should be set in your FHiCL file
    services.scheduler.fileMode: NOMERGE
    In this mode, the "first run instance wins" facility is disabled. I have tested that this mode enables a successful job execution (i.e. no segmentation violation). However, be advised that the NOMERGE mode means that a set of output files (the number of which is equal to the number of configured output modules) will be produced for each processed input file, leading to a multiplicity of output files that may not be desired. (If the file-renaming specifiers are not used -- as explained here -- then each time an input file is closed, an output file is also closed, possibly overwriting a previously closed output file.) For some, this is not a problem since retrieving the product is used only to create one ntuple for the entire job.

Please let us know if you any particular concerns or further questions about your specific case.

#4 Updated by Kyle Knoepfel about 4 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF