Relational model and data formats

In a discussion Marc had with Adam Lyon (of g-2), Adam expressed a wish to be able to more easily access event data written by art using the R language and environment, in an interactive session.

The original design of the art data model was intended to support writing "ntuple" type data directly as event data products. This has not been very successful in CMS -- there is little, if any, use of event data files directly from Root. Instead, CMS has defined an alternative framework, usable from the Root/CINT prompt, that provides some of the facilities of the offline framework: in particular, it provides the ability to navigate the structures that relate data products. There are at least two problems that have prevented the intended direct use of the Root/EDM data files:

  1. Many of the event data objects defined by CMS are complicated, involving inheritance hierarchies, containers that perform type erasure (called by CMS "polymorphic contains"), and embedded navigational elements. This is a violation of the intended usage, in which the event data objects were intended to be essentially "dumb data"; behavior was intended to be added to the dumb data by wrapping the dumb data with a class which would reference the dumb data and provide the necessary interface.
  2. The event data model's persistency, as implemented using Root, writes one entry per event. Some ntuple usage requires writing an entry per track, or jet, or cluster, etc. There is no one level of aggregation that is correct for all uses; analysis requirements are much more varied.

The result seems to be that the goal of providing a single storage format that is suitable for all uses has not been achieved.

The ability of accessing event data directly from R, or in general from tools other than framework-aware C++ code, would be enhanced by sticking to the original design plans: event data objects should be essentially "dumb data", and associations between the data products should be made externally to the objects, in a manner like the relational model. As of art release 1.0, we have introduced the Assns class template, and the "smart query" template FindOne and FindMany to help users to do the relational calculus "join" operations. This moves our data storage closer to the relational model.

A tool that allows specification of data products in a manner consistent with the goal of "dumb data", and which provides the serialization and deserialization that we currently obtain through Root, would seem to be of great value. One possible tool is Google protocol buffers.