Project

General

Profile

Feature #11355

Support for a very different sort of secondary input ...

Added by Rob Kutschke almost 4 years ago. Updated almost 3 years ago.

Status:
Feedback
Priority:
Normal
Category:
I/O
Target version:
-
Start date:
01/05/2016
Due date:
% Done:

0%

Estimated time:
Scope:
Internal
Experiment:
Mu2e
SSI Package:
art
Duration:

Description

Mu2e has a multi-stage MC event processing workflow in which some stages resample the output of the previous stage many times. I think that a workflow with just two stages has enough richness to describe the feature request.

Stage 1:
  1. Start with an empty event.
  2. Run an event generator.
  3. Stop processing when certain conditions occur.
  4. Select interesting events - most events will not be interesting.
  5. Write an art event-data file for all interesting events.
  6. Write an ntuple with a minimal kinematic information about interesting particles in interesting events; the ntuple also includes the art::EventID of the event and the G4 track number of the interesting particles.
Stage 2
  1. Start with an empty event
  2. Run an event generator that randomly chooses a single particle from the ntuple produced in stage 1. Apply some randomization to this particle. Record the event ID and track number from the stage 1 event that was read from the ntuple.
  3. Process the event through G4 to completion. Particles will propagate through dense material so additional randomization occurs.
  4. Select interesting events - most events will not be interesting.
  5. Write an art event-data file for all interesting events
Some comments:
  1. Stage 2 resamples the output of stage 1 many times - in the use case in question the reuse factor is 1,000,000. (There is not enough info here to explain why this level of reuse is statistically meaningful - but that is not important to the question at hand ).
  2. All event IDs produced in stage 1 are unique.
  3. All event IDs produced in stage 2 are unique.
  4. The set of event Event IDs from stage 1 and the set from stage 2 are disjoint.
  5. The effective statistics, after resampling, of the full workflow is O(10^20) protons on target.
  6. The final number of events written out will be O(10,000); this workflow is used to study a rare but dangerous background.

The request is the following. At the end of processing we want to assemble an art event-data file that contains, for each stage 2 event, all of data products from stage 2 plus all of the data products from the relevant event in stage 1. For this use case the usual secondary input file technique will not work. The reasons are:

  1. By construction, the event from stage 1 and the event from stage 2 have different event IDs. This is needed to ensure that all events in stage 2 can be distinguished from each other.
  2. Both stages start with the EmptyEvent source; so there will be clashes of art::ProductIDs.
More comments:
  1. It is possible that one event from stage 1 will end up in several events from stage 2.
  2. For a given stage 2 event, we will always read exactly one event from stage 1.
  3. We do not require that the merging process preserve the provenance information of the stage 1 data products; we just want to suck in the stage one data products and make them accessible to analyzers. It is our responsibility to preserve enough information to navigate back to the provenance information in the stage 1 file should that become necessary. This almost identical to the corresponding requirements for event mixing.
  4. It is our responsibility to reseat any Ptr or Assns objects read from the stage 1 file; we would appreciate having helper functions similar to those available for event mixing.
  5. We do not require that art to do any file handling. We will arrange that all necessary input files are visible on local disk when we run the jobs.
  6. If it will be helpful, we can provide information such as: these stage 1 events live in file A and these other stage 1 events live in file B. Just specify how it should be presented.
  7. The process of forming the merged file does not need to be heavily optimized for speed - we will run millions of CPU hours to make the MC sample and we will only do the merging once for O(10,000) events; we will read the merged file many times during the analysis phase.

In case you are wondering - why do we not just run stage 2 by starting from the stage 1 art event-data file? There are two reasons. When we resample, this solution would reuse event IDs, which will complicate downstream processing. The actual workflow is many stages long. If we resample an early stage 1,000,000 times then there are 1,000,000 duplicate copies of the stage 1 info being carried around through the middle stages; it's true that, after the final stage, there will be only O(10,000) events but this is only after many middle stages. This solution has a storage media cost that is too high.

History

#1 Updated by Kyle Knoepfel almost 4 years ago

  • Category set to I/O
  • Status changed from New to Assigned
  • Assignee set to Christopher Green
  • SSI Package art added
  • SSI Package deleted ()

#2 Updated by Kyle Knoepfel almost 3 years ago

  • Status changed from Assigned to Feedback

Can Mu2e determine if the recent enhancements to product mixing in fact satisfy the use case listed here?



Also available in: Atom PDF