Project

General

Profile

Feature #2815

Merging/mixing multiple files event-by-event

Added by Christopher Backhouse about 7 years ago. Updated about 7 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
Navigation
Target version:
-
Start date:
07/06/2012
Due date:
% Done:

0%

Estimated time:
Scope:
Internal
Experiment:
-
SSI Package:
Duration:

Description

Hi,

I would like to be able to take a bunch of "overlay" secondary art files (and optionally a "primary") and have access to all the data products from them for a specific event.

This is similar to what MixFilter currently allows, except that that is designed for essentially-random overlaying, which is ideal for cosmic/rock overlays and potentially simulating different intensities.

What I'm actually doing, is I have a PID process that takes a large amount of memory. It needs to be split into multiple sub-jobs, all writing their own output, that can then be recombined to get the final PID result.

Marc Paterno and I took a look at the code currently in MixFilter/MixHelper, and it looks like it's engineered to always provide the overlay events from one file at a time, which isn't what I need.
On the other hand, I have no need for the random access features, and probably not the most generic merging support.

Interface:
I need to be able to specify multiple input files in the job fcl, ideally with glob syntax.

For each event in the primary file, I need to have access to the products of the corresponding event in each secondary file. I'm able to guarantee that each file has exactly the same event numbers in the same sequence, and it can be a fatal error if there's a mismatch. More generally, perhaps if an event doesn't exist in one of the files it should simply return no products.

A MixFilter-like interface would work OK. It might even be possible to rearchitect MixFilter so it can do this.

For my specific use-case the ideal interface would be to have something like a producer with function:
void produce(art::Evt& evt, vector<const art::Evt&> secondaries);
where each secondary corresponds to one of the secondary files, and is set to the same event number as the primary.
If it simplifies implementation, I (again personally) would be OK with the restriction that I couldn't write Ptrs to the secondaries into the output stream.

Other uses I have in mind within NOvA (eg combining muon-removed and vanilla files for comparison) might want some default merge semantics, where products in the secondaries automatically end up in the primary stream. I'm not sure how they'd be labelled...
This is OK for me too. The PID objects from the secondary can all end up in the stream, and I can merge them at a later step.

Let me know if you need any clarification for what I have in mind. I'm pretty sure Marc had a clear picture when I spoke to him though.
He also mentioned that having an art file open for reading cost significant memory. Can someone quantify that?

As far as I know I'm the only person asking for this right now. But I'm told it shouldn't be too difficult to do, and I do currently have an immediate need for it.

Thanks - Chris

History

#1 Updated by Christopher Backhouse about 7 years ago

Oh, I forgot. I propose the nomenclature "mixing" for the uncorrelated-files mode currently implemented by MixFilter, and "merging" for the files-contain-versions-of-the-same-events feature I'm requesting here.

#2 Updated by Christopher Green about 7 years ago

  • Category set to Navigation
  • Status changed from New to Accepted

I've just read over your request again with all its details, and it seems that in the first instance you're looking at a particular product: a PID object. You suggest from your description that you are running a memory-hungry algorithm that you need to be able to split up into smaller chunks to provide different pieces of the answer. We would certainly be willing to take a look at the algorithm to see if the memory requirements of the algorithm can be mitigated at all, perhaps using the C++2011 features which will be available with the next art suite / art externals release. It also seems a little strange to me that serializing in different execs will mitigate memory usage, but serializing within a module would not solve the problem. I'm sure this could be made clear to me with a little more detailed discussion of the problem (and perhaps some code snippets) but I'd like to be sure that we're not contemplating an overly general solution to a problem which is solvable more easily some other way.

However, treating your request on its face seems the most efficient way to progress for now. Providing an interface for merging multiple representations of the same event is certainly doable, although my analysis has yet to tell me whether the better solution is to extend MixFilter functionality or to have something similar but different for this purpose specifically.

One question: would you be happy with the entity to be put into the composite being a collection of objects rather than a single one? This would solve the problem of declaring n products for the merge, having to do something programmatic with instance labels and then read them on the analysis side with a GetMany call and sort through them. A variation on this would be to have the system merge only collections in this way.

Regardless of your answer to the above question, I expect it to be relatively straightforward once we have the general solution in place to provide a detail object that would handle the automatic merging of all products.

There are a couple of wrinkles to the general solution to this problem that may not be obvious, so I'll spell them out:

  1. You will need to think about provenance very carefully: what information might you need about these merged products?
  2. Having the event information in the primary match that in the secondaries needs to be handled outside the merging system because of the separation in functionality between the input source and the producers / filters. You would need to have the source configuration set to read one of the fragment files and remove all of its products with a DropOnInput directive.

I think that's it for now. Please update the ticket when you have had chance to consider the questions asked here, and we will move on this as soon as we can given NOvA's other priorities.

#3 Updated by Christopher Backhouse about 7 years ago

Christopher Green wrote:

It also seems a little strange to me that serializing in different execs will mitigate memory usage, but serializing within a module would not solve the problem. I'm sure this could be made clear to me with a little more detailed discussion of the problem (and perhaps some code snippets) but I'd like to be sure that we're not contemplating an overly general solution to a problem which is solvable more easily some other way.

The memory problem here is a general one that I don't believe can be fixed another way. The PID operates by comparing the input event to a large library of Monte Carlo events. The representation of those is pretty optimized (TTree of channel IDs and charges). Right now I use 6GB of memory, which is getting difficult to run on clusters, and we may well increase the library size by over an order of magnitude. Matching time isn't so long that reading through the file sequentially for each new trial event would make sense.
Splitting into multiple sequential jobs and matching against a small part of the library in each will work fine though. A similar PID was used on MINOS, and they had to use a similar technique.
The merging of the various match lists is then an easy task.

One question: would you be happy with the entity to be put into the composite being a collection of objects rather than a single one? This would solve the problem of declaring n products for the merge, having to do something programmatic with instance labels and then read them on the analysis side with a GetMany call and sort through them. A variation on this would be to have the system merge only collections in this way.

Sure. So long as the information makes it into the stream in some format I can have a downstream module massage it, I already need to do a merging operation specific to my PID. Right now I think I have a vector<PID> in the stream, so a vector<vector<PID> > that I can flatten later is fine, as is a preflattened vector<PID>.
I'd probably use a "drop" command in the output so all my independent runs produced only PID objects, and then merge those onto the original file that has had no PID run on it as yet.

  1. You will need to think about provenance very carefully: what information might you need about these merged products?

I'm willing to totally forgo provenance information, but that's probably not within the art philosophy...
For this PID application it's probably fine to include the details of just one of the merged files. For some of the other applications I had hazily in mind that's probably not enough.

  1. Having the event information in the primary match that in the secondaries needs to be handled outside the merging system because of the separation in functionality between the input source and the producers / filters. You would need to have the source configuration set to read one of the fragment files and remove all of its products with a DropOnInput directive.

I'm not sure I totally understand this. But if it's just syntax needed in the fcl file then that's fine, probably I will understand it when I see it spelled out.

Feel free to contact me, in any medium, for more details.

#4 Updated by Christopher Backhouse about 7 years ago

Here's another attempt to explain the physics behind my problem, and a sketch of what the minimal feature that would make me happy could look like.

This PID (which, as I said, has a history from MINOS, so isn't completely crazy) tries to find the best matches between the event to be ID'd and a large library of examples from Monte Carlo. The larger the better. The match is in terms of essentially a picture of all the hits in each event. Whether or not my storage can be optimized by another factor of two or so, eventually the MC library will be so large it can't all fit in any reasonably amount of memory (batch nodes tend to have 1-2G per core, I have an 8G library and still growing).

To construct the PID over many events, imagine filling in a matrix. Rows are input events that need ID'ing. Columns are events in the library. In the normal flow, I get an event from art (so I start in the first row) and work across the columns computing the match quality for each library entry. Then I perform some reduce operation across the row to get the final PID value.

The solution on MINOS to not being able to hold the whole row's worth of MC library in memory at once was actually to scan the matrix in the other order. Hold all the input events at once and load library from disk one at a time and compare to all the input. That is, go down the columns of the matrix instead of across the rows. This is kind of a hack, and really doesn't scale well, or have nice properties for combination with any other processing and so on.

The solution proposed here (I discussed this at length with Ryan, who was heavily involved in the MINOS implementation) is to have multiple jobs. Each job computing one vertical slice of the matrix. So, each job loads in, say 10% of the library, and matches all the input it's given, one at a time. The output is a condensed representation of the matches (the "reduce" operation is basically just to keep the best N). Then a separate job takes all these outputs and merges them together to give the same result as if we'd run it all at once. This solution is basically optimal in terms of resources. Even if the full library won't fit in memory at all, no matter, just subdivide it further and run say 10 jobs of 5% of the library, followed by the other 10. Each library event gets read from disk only once, and each comparison is made only once. Trial "data" events are read as many times as there are jobs, but there are far fewer of these than there are library, and matching is relatively time-consuming (O(seconds) right now, MINOS got up to O(minutes)) so that doesn't really matter.

OK, so that's the physics problem, and solution. Here's a possible implementation:

We take a raw data file, and run one art job on it, this chain for NOvA is something like:
daq2rawdigit, calhit, slicer, tracker, presel
giving us a file, say presel.root

Now we run, say 10, separate jobs on presel.root, each one runs just one module:
pid_matcher
And each is told in its fcl configuration to read in a different 10% of the library. Each is configured to write to a different output file, say pid_matches_[0-9].root. The output is in the form vector<PIDDetails>. One for each preselected interaction in the art event. Each PIDDetails is essentially this list of N best matches. Probably I would use a "drop" command in the output to keep only the PIDDetails so as to save space.

So far, all of this is easily possible within art. But now we need to merge all the match lists together to find what the best matches would have been if we hadn't had to split the library.

Somehow I have a new module PIDMerger. The input it's given on the command line is the presel.root from above, with the full event details. In the fcl I specify "pid_matches_*.root". The effect of this module is simply to pass-through all the Events from presel.root, but for each, collect the vector<PIDDetails> from the pid_matches files, and simply put a vector<vector<PIDDetails> > in the stream. We know that event indices and so on will line up, because everyone ran over the same original file, and there's no filtering etc. If anything unexpected happens we can have the whole thing abort.

Finally (likely in the same job) I would have another module (PIDFinalizer?) that actually massaged the vector<vector<PIDDetails> > into the final PID value. Fundamentally this consists of keeping the very best matches from any of the inner vector<PIDDetails> and forming simple combinations of their properties.

The PIDMerger module is the only novel thing here, and I've deliberately kept its operation very simple. One could attempt to fold PIDFinalizer into it, but I'm not sure how good an idea that is. The way this is currently constructed one could easily get a quick taste of a file by running a single PIDMatcher in the same job as PIDMerger etc. This would give you a less-sensitive PID based on a smaller library, but all in a single job without having to keep track of the split/gather operation.

Let me know if any of this is unclear. I can talk more by phone/email etc.

Thanks - Chris

#5 Updated by Christopher Green about 7 years ago

  • Status changed from Accepted to Rejected

NOvA has decided an alternative approach to satisfy the need that precipitated this request.



Also available in: Atom PDF