New SAM metadata schema for IF experiments¶
The current SAM metadata schema has some disadvantages, some of which have become apparent with use in Run II, and some come from trying to cram Neutrino physics concepts into a database designed for collider data. Additionally, the dimensions language is flexible for querying the metadata, but has some problems of its own.
- The current schema has no place for certain concepts, most notably subruns. Different ways have been used to get around this - encoding the run and subrun into the existing run field for Minerva and using the partition field for NOvA. The former complicates queries even more; the latter doesn't allow you to associate multiple subruns with a single file.
- The general parameters are flexible, but are limited to string values only - it's not possible to query values as numbers (so no numeric ranges, for example).
- The generated SQL queries can be inefficient in some cases. This is likely to be an issue with any complicated generalized query generator. On the other hand, everything seems like a small problem compared to D0, so this may not be a problem in practice.
- The dimension implementation is poorly specified, has some surprising precedence rules (minus tends to trip people up), and it's possible with a complicated enough query to push the parser code into using 100% CPU for tens of minutes.
- Adding new features to the dimensions is nearly impossible (nobody understands the dimension and query generator code, or wants to).
We can replace (or augment) the existing metadata with new tables which match the requirements of the new experiments better. Rather than change the existing SAM database schema it would be better to put the new tables into a separate schema. Some of the existing SAM tables can be continued in use (the physical file data, the application data, parentage, etc). Keeping the existing data_files table will mean that no changes to the sam station code will be needed. Changing the metadata like this will require replacing the metadata declaration, retrieval, and querying all at once. Maintaining backwards compatibility would be difficult, so this would likely have to be a complete replacement (some backwards compatibility might be possible - for example, automatically translating metadata in the old declaration format into the new tables).
There should be no changes to the behaviour for the Run II experiments (and for Minerva if they prefer to stick with what they are currently using).
An important question is what facilities are needed to query the new metadata schema. One solution would be to restrict queries to "fill in the blanks" pre-written queries. However, although this is likely to be sufficient for most purposes, it's unlikely to be sufficiently flexible for all. Advanced users could be allowed to write SQL queries directly, which provides maximum power, but this requires knowledge of both SQL and the schema.
An alternative would be to implement a new version of the dimensions language. Instead of a poorly defined handwritten parser we could use a proper grammar with a parser generator. And the sql generator could be replaced by something more flexible and maintainable (perhaps using a library such as Sqlalchemy to abstract away at least some of the sql generation). Since exact compatibility wouldn't be retained, It would be possible to fix some of the deficiencies of the current implementation.
- Do we (meaning the experiments) need a new metadata schema at all?
- Should any replacement try to cover all possible experiments or should we customize it specifically for individual experiments?
- What querying facilities are required? Do we need the flexibility of the dimensions language or its equivalent?