Segmentation fault reading run products
This was reported by MicroBooNE (Joseph Zennamo) to LArSoft.
A job produces a segmentation fault after attempting to read a run data product.
The issue is not triggered by the content of a specific input file, but it can be systematically reproduced.
Instructions on how to reproduce the issue will follow shortly.
#1 Updated by Gianluca Petrillo almost 5 years ago
I have simplified the problem so that the following job uses only:
- MicroBooNE configuration and input data
Steps to reproduce (requires access to MicroBooNE dCache area;
woof.fnal.gov has it):
source /products/setup setup uboonecode v05_11_01 -q e9:prof mkdir input wget -O input/good_25cm_numuInc.list https://cdcvs.fnal.gov/redmine/attachments/download/34565/good_25cm_numuInc.list mkdir job wget -O job/filter_fhicl_file_Length25_simplified.fcl https://cdcvs.fnal.gov/redmine/attachments/download/34564/filter_fhicl_file_Length25_simplified.fcl lar -c job/filter_fhicl_file_Length25_simplified.fcl -S input/good_25cm_numuInc.list
The end of the output message I get is:
%MSG Begin processing the 6th record. run: 6216 subRun: 12 event: 641 at 16-May-2016 13:07:58 CDT TimeEvent> run: 6216 subRun: 12 event: 641 1.09673e-05 %MSG-w LoadNewGeometry: PostSourceRun 16-May-2016 13:07:58 CDT run: 6218 cannot find sumdata::RunData object to grab detector name this is expected if generating MC files using default geometry from configuration file %MSG 16-May-2016 13:07:58 CDT Closed file /pnfs/uboone/scratch/users/jaz8600/numuInc_DATA_25cm_out/v05_11_00/10059810_998/PhysicsRun-2016_5_4_0_23_17-0006163-00048_20160504T072251_bnb_20160504T082406_merged_20160504T121310_reco1_20160504T132251_reco2_20160515T040145_wflash.root 16-May-2016 13:07:58 CDT Initiating request to open file /pnfs/uboone/scratch/users/jaz8600/numuInc_DATA_25cm_out/v05_11_00/10059811_999/PhysicsRun-2016_4_24_19_17_11-0006058-00144_20160425T073253_bnb_20160501T134116_merged_20160501T171250_reco1_20160501T233543_reco2_20160515T010459_wflash.root 16-May-2016 13:07:58 CDT Successfully opened file /pnfs/uboone/scratch/users/jaz8600/numuInc_DATA_25cm_out/v05_11_00/10059811_999/PhysicsRun-2016_4_24_19_17_11-0006058-00144_20160425T073253_bnb_20160501T134116_merged_20160501T171250_reco1_20160501T233543_reco2_20160515T010459_wflash.root %MSG-w LoadNewGeometry: PostSourceRun 16-May-2016 13:07:58 CDT run: 5985 cannot find sumdata::RunData object to grab detector name this is expected if generating MC files using default geometry from configuration file %MSG Begin processing the 7th record. run: 5985 subRun: 105 event: 5263 at 16-May-2016 13:07:58 CDT TimeEvent> run: 5985 subRun: 105 event: 5263 1.00136e-05 %MSG-w LoadNewGeometry: PostSourceRun 16-May-2016 13:07:58 CDT run: 6058 cannot find sumdata::RunData object to grab detector name this is expected if generating MC files using default geometry from configuration file %MSG %MSG-w LoadNewGeometry: PostSourceRun 16-May-2016 13:07:58 CDT run: 6070 cannot find sumdata::RunData object to grab detector name this is expected if generating MC files using default geometry from configuration file %MSG %MSG-w LoadNewGeometry: PostSourceRun 16-May-2016 13:07:58 CDT run: 6079 cannot find sumdata::RunData object to grab detector name this is expected if generating MC files using default geometry from configuration file %MSG %MSG-w LoadNewGeometry: PostSourceRun 16-May-2016 13:07:58 CDT run: 6143 cannot find sumdata::RunData object to grab detector name this is expected if generating MC files using default geometry from configuration file %MSG Segmentation fault
The Geometry service loads the geometry specified in the configuration file.
Then on each new run it checks if the geometry has changed by reading a
RunData object from the run.
If the object is not present, a warning id produced and no operation is supposed to take place.
The crash appears to be on a
Run::getManyByType() call at
larcore/larcore/Geometry/Geometry_service.cc line 70 (
LArSoft could use some help in solving this issue...
#3 Updated by Kyle Knoepfel almost 5 years ago
- Status changed from Accepted to Resolved
- Assignee set to Kyle Knoepfel
- % Done changed from 0 to 100
The failure is understood. It was sufficient to reproduce the error using two files:
What follows is a printout that shows the contents of each of the files:
============================== File: PhysicsRun-2016_4_24_19_17_11-0006058-00144_20160425T073253_bnb_20160501T134116_merged_20160501T171250_reco1_20160501T233543_reco2_20160515T010459_wflash.root Printing FileIndex contents. This includes a list of all Runs, SubRuns and Events stored in the root file. Run SubRun Event TTree Entry 5985 0 (Run) 5985 104 0 (SubRun) 5985 105 1 (SubRun) 5985 105 5263 0 6058 1 (Run) 6058 144 2 (SubRun) 6058 145 3 (SubRun) 6070 2 (Run) 6070 97 4 (SubRun) 6070 98 5 (SubRun) 6079 3 (Run) 6079 8 6 (SubRun) 6079 9 7 (SubRun) 6143 4 (Run) 6143 12 8 (SubRun)
616312 9 (SubRun)
616313 10 (SubRun)
6177 6 (Run)
6177 50 11 (SubRun)
6177 70 12 (SubRun)
6182 7 (Run)
6182 43 13 (SubRun)
6182 45 14 (SubRun)
621666 15 (SubRun)
621667 16 (SubRun)
6219 9 (Run)
6219 7 17 (SubRun)
6219 8 18 (SubRun)
============================== File: PhysicsRun-2016_5_4_0_23_17-0006163-00048_20160504T072251_bnb_20160504T082406_merged_20160504T121310_reco1_20160504T132251_reco2_20160515T040145_wflash.root Printing FileIndex contents. This includes a list of all Runs, SubRuns and Events stored in the root file. Run SubRun Event TTree Entry 5891 0 (Run) 5891 95 0 (SubRun) 5891 96 1 (SubRun) 5936 1 (Run) 5936 8 2 (SubRun) 5936 9 3 (SubRun) 5977 2 (Run) 5977 105 4 (SubRun) 5977 105 5285 0 5977 106 5 (SubRun) 5977 106 5302 1 5977 106 5319 2 5979 3 (Run) 5979 156 6 (SubRun) 6043 4 (Run) 6043 84 7 (SubRun) 6043 85 8 (SubRun) 6043 85 4258 3 6139 5 (Run) 6139 72 9 (SubRun) 6139 83 10 (SubRun) 6139 83 4167 5 6162 6 (Run) 6162 64 11 (SubRun) 6162 65 12 (SubRun)
616348 13 (SubRun)
621612 14 (SubRun)
621612 641 4
621613 15 (SubRun)
6218 9 (Run)
6218 96 16 (SubRun)
6218 97 17 (SubRun)
The sparseness of the file is not a problem. What is a problem, however, is that the runs 6163 and 6216 (highlighted in blue) are present in both files. The job is running in a mode where the "first run instance wins." In other words, the default processing mode is that all information associated with a given run corresponds to the first file in which it was seen.
In this particular case, whenever the second file is read and run
6163 is queried for the
sumdata::POTSummary object, since that run appeared in an earlier file, the lookup corresponds to that earlier file. Since no product was present in the first file, an attempt is made to check if any secondary files were configured in which the product might be residing. Unfortunately, in order to ask that question, the framework must ask the first file for its list of configured secondary files. The first file has already been closed and no longer exists, eventually leading to the segmentation violation.
This type of problem has been brought to our attention several times in the last several weeks/months. With the soon-to-be released art:version:"Arcturus" feature release, this problem has been solved, and users will not encounter it. However, until MicroBooNE moves to that release, there are a few options:
- Make sure there are no
SubRunduplicates in your
- Do not attempt to retrieve
SubRunproducts that are known to be absent in a given set of files, or
- If neither (1) or (2) are possible, then one can operate in the
NOMERGEprocessing mode. To do so, the following should be set in your FHiCL file
services.scheduler.fileMode: NOMERGEIn this mode, the "first run instance wins" facility is disabled. I have tested that this mode enables a successful job execution (i.e. no segmentation violation). However, be advised that the
NOMERGEmode means that a set of output files (the number of which is equal to the number of configured output modules) will be produced for each processed input file, leading to a multiplicity of output files that may not be desired. (If the file-renaming specifiers are not used -- as explained here -- then each time an input file is closed, an output file is also closed, possibly overwriting a previously closed output file.) For some, this is not a problem since retrieving the product is used only to create one
ntuplefor the entire job.
Please let us know if you any particular concerns or further questions about your specific case.