Project

General

Profile

Support #12545

Segfault when input files are out-of-(Sub)Run-order

Added by Kyle Knoepfel over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
-
Start date:
04/29/2016
Due date:
05/02/2016
% Done:

100%

Estimated time:
Spent time:
Scope:
Internal
Experiment:
NOvA
SSI Package:
art
Duration: 4

Description

Martin Frank reported:

I ran into an issue when running over several files with different unsorted runs and subruns and gdb referred the segfault to art::FileIndex::findRunPosition(). The attached gdb session log file shows the details (go all the way to the end).

Each file corresponds to a subrun in NOvA, so when I run over the following files in the stated order, there is no problem:

File 1: Run 19735, Subrun 60
File 2: Run 19735, Subrun 00

When I now interject a file from another run as follows, I get the segmentation fault in the attached log file:

File 1: Run 19735, Subrun 60
File 2: Run 19728, Subrun 27
File 3: Run 19735, Subrun 00

I am using art v1_17_06 -q e9:s28:prof.

gdb.log (240 KB) gdb.log Kyle Knoepfel, 05/06/2016 01:19 PM

History

#1 Updated by Kyle Knoepfel over 4 years ago

  • Category set to Infrastructure
  • Status changed from New to Closed
  • Assignee set to Kyle Knoepfel
  • % Done changed from 0 to 100
  • SSI Package art added
  • SSI Package deleted ()

My response to Martin was:

[snipped...]

There are two simple ways to proceed. One is to make sure that you’re presenting runs in a contiguous order. The other is to change the fileMode parameter with which you’re running. Specify somewhere in your FHiCL file:

services.scheduler.fileMode: NOMERGE

This has the benefit of re-reading the RunData product from every input file when it’s requested (i.e. not trying to keep and use the product from the first file that had one). The downside is that it can wreak havoc with your ouput modules if you’re not careful. If you’d like to pursue this solution, stop by my office and we’ll work through the details.

In the next version of art:

We have run into this kind of problem multiple times. In principle, art can handle files that are introduced in an order where run/subrun numbers are non-contiguous. However, doing so with current art is a bit cumbersome…as you’ve already encountered. In the next feature release of art, Run and SubRun products will not be cached across multiple input files, so you will never encounter this kind of behavior. The feature-release version (which is based off of ROOT6) is likely to be released in the next few weeks.

It turns out the snipped portion of the email I wrote was slightly incorrect, however, the conclusions of the email were accurate. To confirm, Paul and I then spent some time tracing through the code, and Martin stopped by and Paul navigated through the program using gdb. Indeed, it was discovered that the caching of the Run and SubRun principals in the default file mode was causing the problem.

As noted, this will not be an issue in the next feature release of art, and Martin said he was able to proceed in the meantime using the NOMERGE file mode.

#2 Updated by Kyle Knoepfel over 4 years ago

  • Due date set to 05/02/2016
  • Start date changed from 05/06/2016 to 04/29/2016


Also available in: Atom PDF