Problems with Run and SubRun objects

Note for "artists": in this write-up, I use Run in place of RunPrincipal, SubRun in place of SubRunPrincipal, and Event instead of EventPrincipal, because that is the terminology used by all our users.

Run and SubRun objects are currently similar to Event objects in that they are containers of almost-arbitrary experiment-defined types. But the Run and SubRun differ from the Event in that the Event has no "begin" and "end" time; it is merely has a time at which it was collected.

The fact that Run and SubRun objects represent a span of time (or equivalently, a span of Events) causes trouble because when a DAQ starts a run (or subrun) it knows the starting time, but not the ending time. There is also other information regarding the run (or subrun) that is known only when it is over -- such as the number of events it contains, or the integrated luminosity or number of protons on target it represents. Experimenters may want to put an object into a Run or SubRun representing a summary of some quality of the events in that interval, and the summary can't be complete until the interval is complete.

Run and SubRun objects also differ from Event objects in that an Event object can not be spread across multiple files, while a Run or SubRun can. It is thus possible to have already written a Run object to an output file, and closed that output file, before the "end time" of the run is known; more generally, both Run objects and SubRun objects may be written out output files in an incomplete state, because the knowledge of how to "complete" them is not known until after the file in question is written and closed.

A possible solution to this problem is to replace the Run and SubRun by a RunFragment and a SubRunFragment, where each of these represents the portion of a run (or subrun) represented in the file that contains the RunFragment (or SubRunFragment).

This change alone is not a complete solution, because we still have to figure out what to do about cases in which we want to merge M input files into 1 output file, or split 1 input file into N output files. The general case of turning M input files into N output files would seem to be unfeasible to handle.

Comments from Mark Messier

I thought about these issues in FMWK and my "solution" was to punt completely on these concepts and instruct experimenters to put meta data into their database and store it separately from the event files. The issues are annoyingly complex and to my mind don't suggest an obvious behavior which is often a signal to me that there is something wrong with the concept. I think whatever one decides for these, the behavior will end up surprising someone.

One note to the discussion on the wiki: the discussion there assumes there that a file will contain data from only one run. I think this assumption is incorrect. Imagine the final nue sample at our far detector. If we enforced this on that data sample we would have ~100 files all with ~1 event per file. Much more efficient to put all these events into a single file.

My recommendation is to keep these run and subrun blocks as simple as possible and limit the framework information to only that which can be known definitely at creation time: run number, subrun number, start run time, start subrun time. I would leave it to the experimenters to add any additional summary information they might want "by hand" at whatever stage of processing the experimenters deem reasonable.

As for file splitting / merging : Again this problem is not as simple as it might seem at first. What is the expected behavior when two files containing events from the same subrun are merged? What is the expected behavior when you create a file containing events from multiple runs and subruns? What if that file only contains events from a few select subruns from the entire run? Suppose one has a final event sample for some analysis where a few hundred events are selected from a few thousand runs. Is it reasonable to expect to be able to (eg.) compute the full exposure for the sample from that final data file? If yes, then one needs to be able to loop over all the run information, even for those runs which didn't not happen to produce signal events. If the answer is "no" then how much good does this meta data concept do you anyhow?

As for the questions about merging / splitting:

o I think that output files should contain complete run/subrun records from all files that were seen on input no matter if events from those runs/subruns actually end up in the output stream or not. Methods should be provided so that one could loop over all the run/subrun records in an "endJob" if the user wants to.

o In the case of merging, I think one should always do complete merges. If users have identical data products stored in the run/subrun tree in different files, the output file should merge these into one long vector of products on the run / subrun trees. If the framework information is kept to only to what is known at creation time there will not have to be any special handling - the information would be identical for two identical subruns. If one allows this information to become more complex, then many decision have to be made. Eg. what if one run record has a different end time than another? Should I take the latest in the merge? Should I flag this as a error?

o Normally I'm a "lets get a prototype working and see where it goes" kind of guy, but this is one case where I think the issues need to be explored in advance. Once data is on disk, this stuff is locked in and any fixes will have to be evolutionary resulting in tail bones and other needless complexity.

Comments from Andrew Norman

When we structured the raw data we discussed this same issue extensively. What we came up with was the following structure that we think addressed the major issues.

First off we embedded in each “event” enough information about the parent run that the event belonged to, that it could live on it’s own and reconstruct any required meta data or conditions data via a database lookup based on run/subrun. We also allowed for the concept of “universal event number” where by each event has an absolutely unique event number which could again be used to determine which set of conditions the event was taken under.

The reason we did this was to allow for events to be skimmed out of their original raw files (which do impose a run/subrun heirarchy) and joined together in a merged file without loss of information.

The advantage we saw to embedded in the run/subrun information in the data files, was that by wrappering the events in a “run/subrun block” we were giving an analysis program a cue that all the events in this block would use the same (or nearly the same) conditions data. This would then allow the analysis to more efficently cache conditions data.

In the extreme cases where all the events come from the same run/subrun, then only a single database query needs to be made near the start of the job. This is obviously a good thing, and potentially a large performance factor.

In the other extreme, where a skim has only a single event per run, then the information that was embedded in the event is completely redundant with the run/subrun information. A new query for conditions data needs to be made for every event anyways, and caching that the analysis system does is wasted.

In an ideal world, the framework would understand data files both with and without run/subrun information. If the run information was present, then it would trigger some sort of caching mechanism. If the run information were missing, it would default to making queries from information embedded in the events.

So that is what we were thinking when we initially designed this. Let me know if you want more clarifications or have questions.