Project

General

Profile

CAFAna overview

The CAFAna framework provides a collection of classes allowing easy plotting and fitting of oscillated spectra in the NOvA experiment. It is intended to be equally applicable to the numu and nue analyses, and any other oscillation analysis that may be performed. This note provides a bottom-up description of the available capabilities.

Are you starting here? Stop take a look at CAFAna resources, particularly the tutorial video, then come back. Everything should make much more sense.

Introduction

The CAFAna framework aims to be easy to understand and use. This note provides a walk-through of the basic functionality, building up from basic components towards performing a full oscillation analysis. It is not intended to be comprehensive documentation of every class and function. Often there is in-line documentation available to be browsed on doxygen. There are a few analysis macros checked into SVN which you can examine, and the number will grow.

Some aspects of CAFAna are still in flux. The most prominent of these are marked in the relevant sections.

Philosophy

The design of CAFAna is based on experience with the MINOS analysis frameworks NCUtils and NtupleUtils. Lesson one: don't let multiple incompatible frameworks propagate, it makes it very difficult to perform a joint analysis. \cafana is intended from the start to be equally applicable to any oscillation analysis, all of which share the same core concepts.

CAFAna uses the Common Analysis Format (CAF) files as its input medium. As such, it is not dependent on the ART framework. This is a deliberate separation of concerns. Any non-trivial reconstruction happens on the ART side, and the results are stored in the CAFs. CAFAna is only concerned with plotting and analyzing (extrapolating, oscillating, etc.) these pre-existing variables.

CAFAna aims not to be monolithic and constraining. Instead, the goal is to produce a collection of useful components, that can be easily "snapped together" to form an analysis. There are many different things that one may want to do, and so it's natural that some of the logic should live in the analysis macros.

Two lessons learned about performance: any algorithm that needs to repeatedly go event-by-event is going to be slow, and will get slower as you add more Monte Carlo statistics. For this reason, everything in CAFAna is histogram-based. The input files are read exactly once to produce the histograms, and then everything else is done in terms of those. This means that performance is independent of statistics. Second: the ROOT histogram classes are convenient (and they are used extensively in CAFAna) but they can be fairly inefficient for calculation. For this reason histograms are, as far as possible, encapsulated inside other classes. This means that various critical operations can be optimized. If it ever becomes necessary, the implementation can be replaced with something more efficient, in principle without affecting analyses built on these components at all.

The final lesson is that systematics are one of the hardest aspects to get right. It's very easy to create non-smooth behaviour that requires extensive baby-sitting of MINUIT. It's also easy to get very bad complexity, with run-time rising rapidly as systematics are added. For these reasons, fits are often routinely done without systematics until it's almost time to open the box, at which point predictable chaos ensues. I don't yet know how well CAFAna will handle these issues, but it's something for analysis groups to remain aware of.

cafe

To run a CAFAna macro, or to simply inspect a CAF file, use the cafe executable. The help text is quite useful:

$ cafe --help
** NOvA Common Analysis Format Executor **
usage: cafe [-h] [-b] [-q] [-bq] [-s N] [-o M]
            [--gdb | --totalview | --valgrind]
            script.C [args [args ...]]

positional arguments:
  script.C          the root script to run. Append a + to force recompilation
  args              optional arguments passed to script main function

optional arguments:
  -h, --help        show this help message and exit
  -b, --batch       batch mode, no graphics output
  -q, --quit        quit at end of job
  -bq               shorthand for -b -q

stride and offset:
  -s N, --stride N  use only every Nth file of datasets
  -o M, --offset M  skip first M files of dataset (for use with -s)
  -ss, --snapshot   use latest snapshot instead of requerying

debugging options:
  --gdb             run root under gdb
  --totalview       run root under totalview
  --valgrind        run root under valgrind
  --prof            run with google perftools profiler

Alternatively you may specify a single .root file to open it in ROOT with CAF
libraries loaded. Or no arguments to just start ROOT with libraries loaded.

Fundamentally, cafe just ensures that the correct shared-libraries are loaded before starting ROOT with the macro specified. The --gdb and --valgrind options are useful for debugging. --stride and --offset are discussed in CAFAna on the grid

In addition, cafe allows you to pass arguments to your macro UNIX-style on the command line. It deduces the type of each argument, and escapes and quotes them appropriately as required by ROOT.

So, for example, you can run

$ cafe mymacro.C 1 2 foo bar

which leads to the function call mymacro(1, 2, "foo", "bar").

All CAFAna macros must be compiled by ACLiC, and cannot be run in CINT mode. This is enforced by cafe. CINT has many limitations in its handling of the C++ language, which can lead to strange bugs. Also, it can't parse some of the CAFAna header files.

Spectrum

A Spectrum consists of a histogram representing the number of events per bin, and a POT holding the exposure that those event counts represent.

For plotting, a Spectrum can be converted into a TH1, scaled to some particular exposure. So for a data/MC comparison you would plot the data at its own exposure, and the MC scaled to the data's POT.

When constructing a spectrum you can provide an axis label and whatever binning scheme you like.

HistAxis

This trivial class collects together a Var, a binning scheme, and a label. It allows this associated information to be propagated around the code as a unit more easily.

2-dimensional histograms

There is a constructor for Spectrum that takes two variables and binnings. Internally, the data is represented as a 1D histogram using Var2D (see Var2D), but you can retrieve the 2D information with the ToTH2()} function.

Ratio

The ratio class is constructed using two Spectrums. It automatically takes care of any POT difference between the two. Ratios may be combined with each other and with Spectrums using the standard multiplication and division operators (multiplying/dividing two Spectrums directly doesn't make sense).

OscillatableSpectrum

An OscillatableSpectrum is similar to a Spectrum, but with a second dimension representing true energy. With the use of an OscCalculator this allows a 1D spectrum to be produced at any oscillation parameters. OscCalculatorPMNSOpt is usually the right choice, being faster than the others and equally precise.

The function OscillatableSpectrum::Oscillated() operates by weighting each bin of the histogram by the calculated oscillation probability, and then summing the totals onto the user axis, yielding a one-dimensional Spectrum.

For efficiency, the oscillation probabilities can be held in a precalculated OscCurve object.

The true energy binning of an OscillatableSpectrum is predefined (in Binning.cxx). The true energy bins are equispaced in 1/E. This gives a finer binning at low energies, where the variation of the oscillation
probability is more rapid. (The oscillation probabilities go as trig(1/E) so the variation between each bin is of the same order).

Technically OscillatableSpectrum is a specialization of ReweightableSpectrum, the more general, but less common, class in which the second axis is customizable.

SpectrumLoader

The SpectrumLoader class is responsible for reading in StandardRecord objects from one or more CAF files, and filling spectra with their contents. The constructor takes a list of filenames, or a wildcard for the files to use.

Individual Spectrum or OscillatableSpectrum objects can be registered with the loader. Or, more usually, the loader is passed to the constructor of the spectrum, which then registers itself appropriately.

When all spectra have been registered, calling the Go() function of the loader will cause it to loop through the input file(s), extracting variable values and filling the histograms. The total POT in the input files is also accumulated and used to set the POT of the spectra.

The advantage of this scheme is that it allows multiple histograms to be filled with only one pass through the input CAF file(s). Reading files in from disk can be a major bottleneck, so it's important that it be kept efficient.

It usually only makes sense to fill a spectrum from files of the same type, so a normal analysis macro will create two or more loaders, one for nonswap and one for fluxswap MC. If ND or data files are used, those are additional loaders too, each responsible for filling a subset of the spectra.

Note: The SpectrumLoader refers to the spectra via pointers, so you mustn't relocate them between the time they are registered and the call to Go(). (The most likely problem is to hold them in a vector that reallocates its elements due to subsequent additions to it).

Loaders

The number of SpectrumLoaders required for an analysis can get quite large (ND/FD, data/MC, swap/nonswap etc). The Loaders class holds all the combinations to make them easy to pass around. If you create it with a directory, it will attempt to find CAFs following standard naming conventions. You can also manually set the location of any CAF.

Cut and Var

When filling a histogram, it's necessary to know what to fill it with. Often this is a simple variable from the CAF files but, especially for selection cuts, it may be a more complex combination.

At heart, a Var is a function that is given a StandardRecord object and returns a number. For a simple case this might just be the value of one of the fields, but one can also combine variables and take decisions in any way.

A Cut is the same, but returns a boolean indicating whether or not the event should be used.

You pass a Var, and optionally a Cut, when registering a Spectrum.

In practice, in addition to this function, a Cut or Var also has a list of strings, naming the fields from the CAF file that it requires. Any field that is not named will be uninitialized in the StandardRecord that is passed to the function. This is also done to speed file loading. Most analyses use only a fraction of the available CAF variables, so significant time can be saved by not retrieving the unnecessary ones from disk. StandardRecord objects default to NaN for uninitialized fields, so failing to name a required field should lead to a floating-point exception, enabling you to find the problem.

Examples of how to declare a new Cut or Var can be found in Cuts.h and Vars.h, using each of the three styles: global function, class, and lambda-function. Generally useful variables should be added here, analysis specific ones can be defined within the analysis macro where they are used.

Var2D

This is a special variable used in support of 2D histograms. It takes two Vars in its constructor, plus binning information. When evaluated, it returns a bin number for a 1D spectrum that encodes the position in 2D
space. For simple plotting uses, the two-variable constructor in Spectrum should be sufficient. But to do a 2D fit, or other more elaborate uses knowledge of the underlying mechanism may be useful.

Representing 2D spectra as one-dimensional internally has the advantage that no additional code needs to change to handle all the possible operations. For example OscillatableSpectrum still manages a 2D spectrum internally, instead of the 3D one that would be required in the direct implementation.

Overloaded operators

Cut objects work with the usual logical operators (&&, ||, !) as you'd expect, allowing you to make simple combinations without needing to define a whole new Cut object.

In addition, you can combine a Var with the < or > operators to form a Cut.

Prediction

In order to do an oscillation analysis, you need a method to produce an oscillated prediction to compare to the data. The IPrediction interface defines the functions required by an oscillation fitter. A prediction subclass is required to be able to produce a Spectrum corresponding to the expectation at any oscillation parameters, for comparison against data; and to be able to produce any component (NC, CC, etc) of that, for use in plotting.

PredictionNoExtrap

The most basic prediction is PredictionNoExtrap. This takes two SpectrumLoaders (swap and unswapped), a label, binning definition, Var, and Cut. It holds OscillatableSpectrum objects corresponding to each true CC component (nue/numu, neutrino/antineutrino, appearance/survival) and a Spectrum representing the NC component. When asked to predict the spectrum, it simply oscillates all of these components returns the sum.

PredictionExtrap

In addition to instructions on what variable to predict, and the loader that will enable it to predict cosmic rays, this class takes a pointer to an IExtrap. This is an extrapolation method, predicting what will be at the Far Detector before oscillations

Extrapolation

The TrivialExtrap class simply assumes the FD MC needs no modification. It's in fact how PredictionNoExtrap is implemented.

A more sophisticated and general extrapolation is performed by ModularExtrap. This is a big enough topic to be described in its own page. The short version is that this class can provide you with a Far Detector prediction, with the extrapolation of the different oscillation components treated however you wish, given a best-guess of the various components of the spectrum in the Near Detector. Obtaining these ND components is known as "decomposition". Technically, that step is performed by a class deriving from IDecomp.

Decomposition

A decomposition is a method of decomposing the observed Near Detector spectrum into the three components numu CC, NC, and beam nue CC. Since these three oscillate differently they must be extrapolated independently.

BasicDecomp completely ignores the Near Detector data and simply returns the MC spectra.

MRCC Decomposition

TODO maybe too much detail for main document. Let's talk about MRCC on another page?

MRCC CAF files only contain a reconstruction variable tree for MRCC events. MRCC decomposition, however, requires access to the parent events that MRCC events were constructed from, in order to make numu CC selection cuts on them. The parent.mrccpar branch in MRCC CAF files is filled with information about the parent slice, the most important of which is the slice index (sliceidx) of the parent slice that the MRCC event maps best to. The scanning of the MRCC and parent files in parallel is facilitated by MRCCLoader.

MRCCLoader

This is a parallel to SpectrumLoader. Instead of opening one set of files, it opens two sets of files, the MRCC set and the parent file set, and allows for cuts to be made on both streams. The loader uses the header information (run, subrun, evt and subevt numbers) in the parent file to find the parent slice that matches the MRCC slice. The MRCCLoader assumes that everything is in order, so that the full parent tree does not need to be scanned to look for a match for every MRCC entry.

MRCCDecomp

The MRCC decomposition has two constructors, one only requires the MC input files to construct the NC/MRCC ratio, the other requires, in addition, data files on which the decomposition is to be performed. Using MRCCLoader, numu CC selection cuts are applied on the parent event. The events so obtained are subject to a combination of cuts, all of which are inputs to the constructors. The binning and variable that the decomposition is done in, are also constructor inputs.

The class contains accessor functions that return the NC and MRCC input spectra, NC, numu CC and nue decomposed spectra and NC to MRCC Ratio object which is used in decomposition.

MichelDecomp

This is the other decomposition currently available, based on the differing numbers of Michel electron candidates expected from the different event classes.

Experiment

The Experiment interface only requires a single function, one that takes an oscillation calculator representing some set of oscillation parameters, and returns a chisq (or log-likelihood).

SingleSampleExperiment is the basic building block of most analyses. This is constructed with a Prediction object and an observed data Spectrum. The chisq function calls Predict() on the Prediction to get an expected spectrum, and then calculates the log-likelihood with the data.

MultiExperiment simply takes a list of Experiments and sums up their chisq values. This is useful in the case of multiple runs or sub-samples of the data, to which you want to perform a joint fit.

Other Experiments are possible that don't use the preceding parts of the framework. For example ReactorExperiment is a very simple parabola in sin^2(2theta13) to enable the inclusion of reactor constraints into an oscillation fit.

Fitter

The Fitter object is used to drive MINUIT. The constructor is given an Experiment and a list of oscillation parameters and systematic shifts (see Section Systematics) to determine. Calling the Fit() function with an OscCalculator set to some seed values will attempt to minimize the chisq with respect to the chosen oscillation parameters, while the other parameters are held fixed. The OscCalculator is updated to reflect the best-fit values of the chosen parameters. The return value is the chisq of the best-fit point.

FitVar

A FitVar defines an interface to an OscCalculator, allowing parameter values to be set or retrieved. This allows fitting for complex parameters such as sin^2(theta23), where the value has to be converted (here to theta23) before passing to the calculator. A FitVar also contains the name of the parameter for labelling axes. If a parameter is set beyond its physical limit, the FitVar will set it to the boundary. The Penalty() function enables the Fitter to know this has occurred, and drives the minimization back towards the physical region.

If you need to plot a variable not included, simply add it to FitVars.h

Surface

A Surface is used to calculate a chisq surface as a function of two parameters, and to draw confidence intervals. The constructor takes an Experiment, an OscCalculator defining the values of oscillation parameters not to be displayed, FitVars defining the axes, and binning information. You can also pass a list of oscillation and systematic parameters to be marginalized over.

The Surface performs a scan over the parameters, recording the chisq value at each point in the 2D space. It also performs a minimization, using Fitter, seeded at the lowest grid point, in order to find the true minimum. The surface is normalized as a delta chisq from this point.

The Surface comes with functions to draw the best fit point and contours. The significance level to draw is provided in the form of a surface describing the critical parameter value. There are functions to provide these trivial surfaces for the case of Gaussian statistics.

In one dimension, the Slice() function (in Fit.h) provides a similar service.

NumuSurface

This class facilitates making and drawing numu style fits where both hierarchies are considered simultaneously and plotted on two adjacent axes. The difference over simply having two Surfaces is that the minimum chisq point must be shared between them. Also, setup of the fit, and layout of the final plots is simplified.

Plots

One goal of CAFAna is to be able to make high quality plots by default, for example suitable for inclusion in a talk. There is quite a lot of logic within the Surface class dedicated to making countour plots look good. For example, converting the contours to TGraphs so that contours drawn with dashed lines look correct, and picking suitable binning so that contours behave correctly when intersecting the boundaries of the plot.

In addition, there is a file Plots.h, containing plotting utility functions. DataMCComparison(), for example, ensures that the y-axis of such a plot extends to cover the full range of both histograms, not just whichever is drawn first.

In future, further functions and objects should be added here, to aid in the construction of a broad range of different plots. Any standard analysis plot should be easy to make attractively.

Systematics

This is a big enough topic to deserve a page of its own.

Analyses

The nue analysis

The implementation of the nue analysis is very simple. Analysis preselection cuts may be applied by a Cut object, and then either a PID or calorimetric energy used as the binning variable. The defined FitVars allow the usual contours to be made in sin^2(2theta13), delta, and sin^2(theta23).

For evaluating any sensitivity gain from two-dimensional fits, the Var2D function may be used. This simply produces a Var that reads two variables from the StandardRecord and "flattens" them into a one-dimensional spectrum. This allows a two-dimensional fit to be carried out without any other part of the code having to make any special accommodations.

To perform a joint fit to FHC and RHC data, a MultiExperiment can be used, having as its two children the SingleSampleExperiments representing the two running conditions.

The numu analysis

The numu analysis is somewhat more complex, currently having three separate spectra: contained quasi-elastic, contained non-quasielastic, and uncontained, each with their own selectors and energy estimators.

Cuts and Vars representing all of these exist, and the three sub-analyses are again combined in a MultiExperiment to extract the final contour. The NumuAnalysis class provides a handy way to construct all the necessary spectra for a fake-data study.

Serialization

Often one of the longest parts of an analysis is waiting for the spectra to load from the input files. The serialization mechanism allows you to do this once up-front and then tweak your analysis without having to redo the loading.

The global template functions SaveTo()/LoadFrom() and SaveToFile()/LoadFromFile() facilitate this. Once you've called Go() on your loaders, you can save the results to a file and later retrieve them immediately. Normally you just need to save your top-level Experiment object, all its components will be saved recursively.

The storage format is relatively human-readable, which might be useful for various debugging or non-standard plotting purposes.

Miscellaneous

EventList

The function MakeEventListFile() will go through a set of input files (specified by wildcard) and make an output file containing the list of events that pass a given Cut. This list can then be passed to other tools to make a file for scanning etc.

FileReducer

This class can do two related, but different, things. First, it can create an output file from an input file using only those events that either pass a Cut, or are specified in a file listing the desired events. Second, it can
transfer only the specified Vars to the output file, filling all other fields with NaN. This allows the output file to be much smaller (NaNs compress well), but sadly doesn't cause it to load much faster.

Tips

You will probably find that everything goes around 2x faster if you use an optimized build:

export SRT_QUAL=maxopt; srt_setup -a