Project

General

Profile

Feature #6557

multi-file product reading for ROOT input module

Added by Jim Kowalkowski over 5 years ago. Updated over 5 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
I/O
Target version:
-
Start date:
07/01/2014
Due date:
08/31/2014
% Done:

0%

Estimated time:
0.00 h
Scope:
Internal
Experiment:
MicroBooNE
SSI Package:
art
Duration: 62

Description

We want to support reading of multiple ROOT input files where event data products are contained across many files for the same event ID. In other words, a new processing mode will be introduced that permits one event instance to be built up from data products contained in many input files. The current system requires all products for one event that might be accessed to be present within the active input file. Having this new feature permits access to earlier created products without carrying them forward into the newly written output files that contain derived products.

Motivation: Standard operating procedures require that input files are not modifiable. It is common to carry products forward from input to output so that they are readily available for further processing stages. If the schema of the data products is acceptable, the “fast cloning” procedure in ROOT can be used to duplicate data from input to output without the substantial costs of decompression and deserialization. If it is not acceptable the CPU and memory costs can be very high to replicate data from input to output. In addition, the current practice can require large amounts of temporary disk storage on worker nodes to handle the duplicated data. We have found that it is common for an experiment to carry forward large chunks of data from input and output. We have also found that fast cloning cannot be done in many cases, therefore causing all of the resources problems outlined above.

The art ROOT I/O modules and product management data structures already carry the history and provenance necessary to consider moving to a multiple synchronized input file processing mode. Unique product ID assignment for new data products is very important. There will be rules established about the ancestry relationship across a set of input files to have this feature operation properly. This feature also removes the need for many of the current merge operations necessary for combining events across a run and for combining

Here are a few processing chains that illustrate valid and invalid configurations. Capitol letters indicate file name, lower case letter indicate products within the files, “art” with number indicates processing stages.

Current valid processing chain configuration:
A(x) -> art1 -> B(x,y) -> art2 -> C(x,y,z) -> art4 -> D(z,u)
A,B,C,D -> art-merge -> E

Current invalid processing chain configuration (not common ancestry chain for all input):
A(x) -> art1 -> B(x.y) -> art3 -> C(x,y,z)
C -> art4 -> D(z,u)
C -> art5 -> E(y,w)
C,D,E -> art-merge -> F

Issue 6071 of art has an example of valid and invalid configurations for merging.

Future valid multiple synchronized input file configurations:
A(x) -> art1 -> B(y) [ y is derived from x ]
A,B -> art2 -> C(z,x) [ z is derived from x and y and references both ]
A,C -> art3 -> D(u) [ production of u uses z and requires access to x and y ]

Future invalid configuration:
A(x) -> art1 -> B(y) [ y is derived from x ]
A(x) -> art2 -> C(z) [ z is derived from x in separate job ]
A,B,C -> art3 -> D(u,y,z) [ want to use z and y to make a u ]

Rules and constraints (initial listing or starting point):
0) there must be a primary driver file, this must be the latest file in the history chain
1) all files that form a input file set to be used to reconstitute products must be available when the main driving file is opened.
2) any file in a common ancestry chain can be used to draw additional products from a full event.
3) all products will be pulled and reconstituted on demand i.e. when they are asked for in the event interface.

Notes:
- DB in ROOT file will hold metadata about the event shape, event and run number ranges, and other history items to facilitate this processing. Some of the information will be redundant with the standard file metadata until this facility matures.
- if there will be many open files that form a full set to pull products from, we may need use a LRU algorithm to close out files to limit resource consumption (memory mostly).
- the current multiple sequential file feature will still be operational as it is now (events from runs and sub runs spread out amongst many separate input files). With this new multiple file feature, the sequence of files must turn into a sequence of file sets.
- when multi-schedule art becomes real, several file sets will need to be active at boundary conditions when using the sequence-of-file-sets option.
- there will be some design work needed for cleanly specifying the set of files to pull events and products from when there are events are located across many files. See later configuration possibilities.
- other whiteboard note - uniqueness of event ID within a file, within a chain is very important.
- other whiteboard note - when concatenating, be sure to sort events into the separate files e.g. going from 100 to 10 files, make sure each of the 10 outputs has event ID in increasing order.

Configuration:
Given a standard GRID or cluster processing scenario for a production run that looks like this -
a1 -> a2 -> a3 -> a4 -> a5
b1 -> b2 -> b3 -> b4 -> b5
c1 -> c2 -> c3 -> c4 -> c5
One scenario is to merge the events from { a5, b5, c5 } into one file while doing a further reduction and call the output d5. This works fine because the configuration of physics algorithms is consistent and therefore the product ID use will be consistent. There are also no overlapping event in a, b, and c. With this new feature, it will be possible to configure a job that reads d5 as follows:
fileNames: [ d5, [a5,b5,c5], [a4,b4,c4], [a3,b3,c3], [a2,b2,c2], [a1,b1,c1] ]
If the products from the 3 series were not needed in resolving anything, the configuration could be:
fileNames: [ d5, [a5,b5,c5], [a4,b4,c4], [a2,b2,c2], [a1,b1,c1] ]
There is also no need to have files of similar granularity at each stage. If there are more files in the 1 series, for example, the configuration might look like this:
fileNames: [ d5, [a5,b5,c5], [a4,b4,c4], [a2,b2,c2], [a1a,a1b, b1a,b1b, c1a,c1b] ]


Related issues

Related to art - Bug #8897: Need restoration of erroneously-trimmed interface in art::OutputModule and art::ProductRegistryHelperClosed05/22/201505/22/2015

Is duplicate of art - Feature #6071: Merging input filesClosed04/29/2014

Associated revisions

Revision cb4ca811 (diff)
Added by Christopher Green over 4 years ago

Implementation of secondary file reading (feature #6071 and other requirements as specified in issue #6557).

History

#1 Updated by Christopher Green over 5 years ago

  • Status changed from New to Rejected
  • Estimated time changed from 100.00 h to 0.00 h

This duplicates issue #6071, but contains new information. Time estimate has been reduced to zero. Information will be accessible by following the link from the original issue.

#2 Updated by Christopher Green over 4 years ago

  • Related to Bug #8897: Need restoration of erroneously-trimmed interface in art::OutputModule and art::ProductRegistryHelper added


Also available in: Atom PDF