Using Data Sets

Often times we want to analyze not a single file, but a group of files that share some set of characteristics. This could be the set of files that are from a certain run, a given time range, or share certain other characteristics such as all having a events from a specific trigger. In general terms we refer to this collection of files as a "data set" and it convenient to give data sets symbolic (and descriptive) names which can then be used by others when performing similar analysis.


I want to analyze the set of all raw files which meet the following criteria:

  1. Are from the time range 2011-01-28 to 2011-10-12
  2. Contain NuMI trigger events
  3. Are not marked as "bad"
  4. Are not from short runs (runs with only a single subrun)
  5. Are not from runs with less than 10,000 events

As of today (2011-10-12) there are 6824 files which match this selection.
They have a total file size of: 429.29GB
and have a total event count of: 45,228,486 events

This is a nice "data set" but tracking 6824 different file names could be a problem. Instead I want to give it a symbolic name so that I don't have to remember exactly which files were in this set, and so I can pass you a simple, easy to remember name instead of a list of almost 7 thousand files.

In this case I called this set: "numi_triggers_ndos_28JAN11-12OCT11_norman"

It's a bit wordy, but still easier to deal than seven thousand files.

Speaking of files.....

Data File Locations

As the experiment takes data, it's not practical to store every single data file (raw and offline processed) on the Fermilab central disk services (for NDOS we could do this, but it doesn't scale). Instead files are typically stored on a combination of:

  • Local Disk (i.e. the disk of the node where you are running your job)
  • Central Disk (BlueArc)
  • Cache Disk
  • Archival Tape

The location of a file may change over time as it migrates between different locations (i.e. is loaded from a tape to a fast cache disk, or moves off of BlueArc and onto tape for long term storage). Normally keeping track of this would be a nightmare and at any given time you may not know where your files really are.

Using a data handling service like SAM, this bookkeeping is handled for you.

You simply define your dataset, and SAM finds and delivers your files to you, on what ever disk you need the file to be on.

Defining SAM Data Sets

SAM data sets are defined as a complicated database query that is run when ever the user requests the data set. This means that some of your data sets will appear completely static, while others can dynamically change based on the data that has been taken (e.g. if you define a data set that is all files since 01Jan2011, that data set will grow, while a set based on a limited range, say 02Jan2011-05May2011, will remain static.)

To define a data set in NOvA, there is a simple web interface that will let you construct your dataset, test it, save it, or recall it.

To reach the DEMO version of this interface go to:

You will see a page that looks like:

When working with a data set definition, start by selecting your "data tier" which is the type of data files you are interested. Currently we have data tiers defined for the Raw data, processed (rootified) data, Monte Carlo data, pedestal calibration files and DAQ log files.

Next select either a date range or a run number that you are interested in. You can use multiple date ranges, run numbers or subrun numbers. Each time you add one just make sure to include some logical operation that specifies how you want it selected (i.e. AND or OR). As you build your data set definition it is displayed in the text box near the bottom of the page.

You can additionally select various other criteria including the trigger stream, number of events in a file, the actual file size or any other metadata that is recorded for the class of files that you are interested in.

At any time you can test your data set definition by submitting your query.


One of the features of our raw and processed data files are that they embedded in themselves a large amount of "metadata" that describe the files themselves, how they were generated, and other auxiliary bits of information that can be useful in understanding what is in a file or how it should be grouped with other files.