Organization of metadata

General considerations.

Specific, limited proposal for a user database (issue #2397).

Discussion points on the existing proposal:

  1. Names must convey the very limited nature of (eg) propagation semantics and be more general. Proposal: PerFileWhiteboard, (Owner, Name, Value).
  2. Hardest part of the problem is actually barely mentioned here: that of providing the correct per-file access to this DB (input and output files) as there is no current model for access by-file from within a module (which may be in multiple paths). This will need in-depth discussion.


Looks like we are headed towards three different DB uses in an application:

  1. art meta data using the global in-memory DB concept
  2. SAM or data handling specific, which wants one DB per file
  3. user free-for-all DB, perhaps one per file, but without addition structure or access patterns as the previous ones

Looks like they are tackling two problems at once in their proposal:

  1. file specific for use in SAM
  2. user summary data for things that span runs

These appears to be separate problems. Summary data for datasets stored in each files (not automatic) sounds like it needs to be better understood. We discussed having a meeting to talk through summary data scenarios (use and generation) next week to help us determine a design for this.

Rob indicated that the current need for category/key/value and the proposal for experiment-specific data look to be an implementation of a particular workflow from D0/CDF or other experiments.

The interface and use of the DB that they propose forces the experiment code to correctly propagate summary data from input to multiple output files.

Everyone is interested to know how CMS does this dataset level summary object storage.

What will the command-line query tool look like? It needs to be very simple, such as CSV for all data.

The actual API i proposed in the document assumes that it is a general facility and it is not.

A general policy and API is needed for accessing DBs and handling multiple DBs from input and output.

One of the big issues to work out is the need for a callbacks to notify user code that a input file DB is available, or an output DB is available (new or closing). Sounds like one possibility is to put this functionality into an analyzer, utilizing the event selection mechanisms that are available to the output module. Some of the main things they want to do are to transfer specific numbers from input to output stream (set of trigger bits). Write data that depends on output stream (set of trigger paths or bits). The analyzer interface would allow for this. I'm not sure what the meaning of the event is in this case - perhaps per-event data is kept until file open/close happens, when the DB is "committed" to the file.

Organization within art discussion

Here is an initial list of requirements that we should impose on the system.

(1) Each file has a database associated with it.
The database is available through the FileBlock object.

(2) Products that are scalars, or tabular in structure (arrays of numbers or histogram-like) should be handled automatically.
They should be easily read and written from/to the database using an API that we provide.

(3) Each object added to the database will need some metadata associated with it.
This is likely to be at the dataset level i.e. what bigger physics problem is this file part of? The standard attribute set has not yet been determined. Examples might be production pass, process name, user information, pset id, etc..

(4) Code that collects summary data should be tied to an output module.
The main reason for this is that an output module's configuration is very relevant to the summary object-producing code. The summary objects produced much be matched to the output stream and represent the data that is in the stream (or set of files). The summary object code must also know when the output module is performing actions on its files (open/close for example).

(5) Summary object producer code must see all events, and be aware of the path information.
The path results (from the trigger results object) explain if a path rejected or accepted an event, or if the event failed to be processed. This information can be used to build summary objects that depend on whether or not events passed named upstream paths.

(6) The event and product selection information that is used by the summary object producer code must be the same as the output module that it is associated with.

(7) The trigger results product should be made available through the event interface directly.

(8) a summary object producer needs to be told when an associated file is being closed.
At this time, the accumulated information can be written to the in-file database.

(9) all summary object producers in a job need to be told when an input file is opened.
They can then read out relevant incoming database information that will need to be propagated to their associated output stream.

(10) summary object producing code will only exist in an end_path

(11) One must be able to query summary objects within a file database at the shell command line.

(12) One table per summary object will be generated?

Types of summary data needed by NOvA