The CAFAna framework provides a collection of classes allowing easy plotting and fitting of oscillated spectra in the NOvA experiment. It is intended to be equally applicable to the numu and nue analyses, and any other oscillation analysis that may be performed. This note provides a bottom-up description of the available capabilities.
Are you starting here? Stop take a look at CAFAna resources, particularly the tutorial video, then come back. Everything should make much more sense.
- Table of contents
- CAFAna overview
- Cut and Var
The CAFAna framework aims to be easy to understand and use. This note provides a walk-through of the basic functionality, building up from basic components towards performing a full oscillation analysis. It is not intended to be comprehensive documentation of every class and function. Often there is in-line documentation available to be browsed on doxygen. There are a few analysis macros checked into SVN which you can examine, and the number will grow.
Some aspects of CAFAna are still in flux. The most prominent of these are marked in the relevant sections.
The design of CAFAna is based on experience with the MINOS analysis frameworks NCUtils and NtupleUtils. Lesson one: don't let multiple incompatible frameworks propagate, it makes it very difficult to perform a joint analysis. \cafana is intended from the start to be equally applicable to any oscillation analysis, all of which share the same core concepts.
CAFAna uses the Common Analysis Format (CAF) files as its input medium. As such, it is not dependent on the ART framework. This is a deliberate separation of concerns. Any non-trivial reconstruction happens on the ART side, and the results are stored in the CAFs. CAFAna is only concerned with plotting and analyzing (extrapolating, oscillating, etc.) these pre-existing variables.
CAFAna aims not to be monolithic and constraining. Instead, the goal is to produce a collection of useful components, that can be easily "snapped together" to form an analysis. There are many different things that one may want to do, and so it's natural that some of the logic should live in the analysis macros.
Two lessons learned about performance: any algorithm that needs to repeatedly go event-by-event is going to be slow, and will get slower as you add more Monte Carlo statistics. For this reason, everything in CAFAna is histogram-based. The input files are read exactly once to produce the histograms, and then everything else is done in terms of those. This means that performance is independent of statistics. Second: the ROOT histogram classes are convenient (and they are used extensively in CAFAna) but they can be fairly inefficient for calculation. For this reason histograms are, as far as possible, encapsulated inside other classes. This means that various critical operations can be optimized. If it ever becomes necessary, the implementation can be replaced with something more efficient, in principle without affecting analyses built on these components at all.
The final lesson is that systematics are one of the hardest aspects to get right. It's very easy to create non-smooth behaviour that requires extensive baby-sitting of MINUIT. It's also easy to get very bad complexity, with run-time rising rapidly as systematics are added. For these reasons, fits are often routinely done without systematics until it's almost time to open the box, at which point predictable chaos ensues. I don't yet know how well CAFAna will handle these issues, but it's something for analysis groups to remain aware of.
To run a CAFAna macro, or to simply inspect a CAF file, use the
cafe executable. The help text is quite useful:
$ cafe --help ** NOvA Common Analysis Format Executor ** usage: cafe [-h] [-b] [-q] [-bq] [-s N] [-o M] [--gdb | --totalview | --valgrind] script.C [args [args ...]] positional arguments: script.C the root script to run. Append a + to force recompilation args optional arguments passed to script main function optional arguments: -h, --help show this help message and exit -b, --batch batch mode, no graphics output -q, --quit quit at end of job -bq shorthand for -b -q stride and offset: -s N, --stride N use only every Nth file of datasets -o M, --offset M skip first M files of dataset (for use with -s) -ss, --snapshot use latest snapshot instead of requerying debugging options: --gdb run root under gdb --totalview run root under totalview --valgrind run root under valgrind --prof run with google perftools profiler Alternatively you may specify a single .root file to open it in ROOT with CAF libraries loaded. Or no arguments to just start ROOT with libraries loaded.
cafe just ensures that the correct shared-libraries are loaded before starting ROOT with the macro specified. The
--valgrind options are useful for debugging.
--offset are discussed in CAFAna on the grid
cafe allows you to pass arguments to your macro UNIX-style on the command line. It deduces the type of each argument, and escapes and quotes them appropriately as required by ROOT.
So, for example, you can run
$ cafe mymacro.C 1 2 foo bar
which leads to the function call
mymacro(1, 2, "foo", "bar").
All CAFAna macros must be compiled by ACLiC, and cannot be run in CINT mode. This is enforced by
cafe. CINT has many limitations in its handling of the C++ language, which can lead to strange bugs. Also, it can't parse some of the CAFAna header files.
Spectrum consists of a histogram representing the number of events per bin, and a POT holding the exposure that those event counts represent.
For plotting, a
Spectrum can be converted into a TH1, scaled to some particular exposure. So for a data/MC comparison you would plot the data at its own exposure, and the MC scaled to the data's POT.
When constructing a spectrum you can provide an axis label and whatever binning scheme you like.
This trivial class collects together a
Var, a binning scheme, and a label. It allows this associated information to be propagated around the code as a unit more easily.
There is a constructor for
Spectrum that takes two variables and binnings. Internally, the data is represented as a 1D histogram using
Var2D (see Var2D), but you can retrieve the 2D information with the
The ratio class is constructed using two
Spectrums. It automatically takes care of any POT difference between the two. Ratios may be combined with each other and with
Spectrums using the standard multiplication and division operators (multiplying/dividing two
Spectrums directly doesn't make sense).
OscillatableSpectrum is similar to a
Spectrum, but with a second dimension representing true energy. With the use of an
OscCalculator this allows a 1D spectrum to be produced at any oscillation parameters. OscCalculatorPMNSOpt is usually the right choice, being faster than the others and equally precise.
OscillatableSpectrum::Oscillated() operates by weighting each bin of the histogram by the calculated oscillation probability, and then summing the totals onto the user axis, yielding a one-dimensional
For efficiency, the oscillation probabilities can be held in a precalculated
The true energy binning of an
OscillatableSpectrum is predefined (in
Binning.cxx). The true energy bins are equispaced in 1/E. This gives a finer binning at low energies, where the variation of the oscillation
probability is more rapid. (The oscillation probabilities go as trig(1/E) so the variation between each bin is of the same order).
OscillatableSpectrum is a specialization of
ReweightableSpectrum, the more general, but less common, class in which the second axis is customizable.
SpectrumLoader class is responsible for reading in
StandardRecord objects from one or more CAF files, and filling spectra with their contents. The constructor takes a list of filenames, or a wildcard for the files to use.
OscillatableSpectrum objects can be registered with the loader. Or, more usually, the loader is passed to the constructor of the spectrum, which then registers itself appropriately.
When all spectra have been registered, calling the
Go() function of the loader will cause it to loop through the input file(s), extracting variable values and filling the histograms. The total POT in the input files is also accumulated and used to set the POT of the spectra.
The advantage of this scheme is that it allows multiple histograms to be filled with only one pass through the input CAF file(s). Reading files in from disk can be a major bottleneck, so it's important that it be kept efficient.
It usually only makes sense to fill a spectrum from files of the same type, so a normal analysis macro will create two or more loaders, one for nonswap and one for fluxswap MC. If ND or data files are used, those are additional loaders too, each responsible for filling a subset of the spectra.
SpectrumLoader refers to the spectra via pointers, so you mustn't relocate them between the time they are registered and the call to
Go(). (The most likely problem is to hold them in a vector that reallocates its elements due to subsequent additions to it).
The number of
SpectrumLoaders required for an analysis can get quite large (ND/FD, data/MC, swap/nonswap etc). The Loaders class holds all the combinations to make them easy to pass around. If you create it with a directory, it will attempt to find CAFs following standard naming conventions. You can also manually set the location of any CAF.
Cut and Var¶
When filling a histogram, it's necessary to know what to fill it with. Often this is a simple variable from the CAF files but, especially for selection cuts, it may be a more complex combination.
At heart, a
Var is a function that is given a
StandardRecord object and returns a number. For a simple case this might just be the value of one of the fields, but one can also combine variables and take decisions in any way.
Cut is the same, but returns a boolean indicating whether or not the event should be used.
You pass a
Var, and optionally a
Cut, when registering a
In practice, in addition to this function, a
Var also has a list of strings, naming the fields from the CAF file that it requires. Any field that is not named will be uninitialized in the
StandardRecord that is passed to the function. This is also done to speed file loading. Most analyses use only a fraction of the available CAF variables, so significant time can be saved by not retrieving the unnecessary ones from disk.
StandardRecord objects default to NaN for uninitialized fields, so failing to name a required field should lead to a floating-point exception, enabling you to find the problem.
Examples of how to declare a new
Var can be found in
Vars.h, using each of the three styles: global function, class, and lambda-function. Generally useful variables should be added here, analysis specific ones can be defined within the analysis macro where they are used.
This is a special variable used in support of 2D histograms. It takes two
Vars in its constructor, plus binning information. When evaluated, it returns a bin number for a 1D spectrum that encodes the position in 2D
space. For simple plotting uses, the two-variable constructor in
Spectrum should be sufficient. But to do a 2D fit, or other more elaborate uses knowledge of the underlying mechanism may be useful.
Representing 2D spectra as one-dimensional internally has the advantage that no additional code needs to change to handle all the possible operations. For example
OscillatableSpectrum still manages a 2D spectrum internally, instead of the 3D one that would be required in the direct implementation.
Cut objects work with the usual logical operators (
!) as you'd expect, allowing you to make simple combinations without needing to define a whole new
In addition, you can combine a
Var with the
> operators to form a
In order to do an oscillation analysis, you need a method to produce an oscillated prediction to compare to the data. The
IPrediction interface defines the functions required by an oscillation fitter. A prediction subclass is required to be able to produce a
Spectrum corresponding to the expectation at any oscillation parameters, for comparison against data; and to be able to produce any component (NC, CC, etc) of that, for use in plotting.
The most basic prediction is
PredictionNoExtrap. This takes two
SpectrumLoaders (swap and unswapped), a label, binning definition,
Cut. It holds
OscillatableSpectrum objects corresponding to each true CC component (nue/numu, neutrino/antineutrino, appearance/survival) and a
Spectrum representing the NC component. When asked to predict the spectrum, it simply oscillates all of these components returns the sum.
In addition to instructions on what variable to predict, and the loader that will enable it to predict cosmic rays, this class takes a pointer to an
IExtrap. This is an extrapolation method, predicting what will be at the Far Detector before oscillations
TrivialExtrap class simply assumes the FD MC needs no modification. It's in fact how
PredictionNoExtrap is implemented.
A more sophisticated and general extrapolation is performed by ModularExtrap. This is a big enough topic to be described in its own page. The short version is that this class can provide you with a Far Detector prediction, with the extrapolation of the different oscillation components treated however you wish, given a best-guess of the various components of the spectrum in the Near Detector. Obtaining these ND components is known as "decomposition". Technically, that step is performed by a class deriving from
A decomposition is a method of decomposing the observed Near Detector spectrum into the three components numu CC, NC, and beam nue CC. Since these three oscillate differently they must be extrapolated independently.
BasicDecomp completely ignores the Near Detector data and simply returns the MC spectra.
TODO maybe too much detail for main document. Let's talk about MRCC on another page?
MRCC CAF files only contain a reconstruction variable tree for MRCC events. MRCC decomposition, however, requires access to the parent events that MRCC events were constructed from, in order to make numu CC selection cuts on them. The
parent.mrccpar branch in MRCC CAF files is filled with information about the parent slice, the most important of which is the slice index (
sliceidx) of the parent slice that the MRCC event maps best to. The scanning of the MRCC and parent files in parallel is facilitated by
This is a parallel to
SpectrumLoader. Instead of opening one set of files, it opens two sets of files, the MRCC set and the parent file set, and allows for cuts to be made on both streams. The loader uses the header information (
subevt numbers) in the parent file to find the parent slice that matches the MRCC slice. The
MRCCLoader assumes that everything is in order, so that the full parent tree does not need to be scanned to look for a match for every MRCC entry.
The MRCC decomposition has two constructors, one only requires the MC input files to construct the NC/MRCC ratio, the other requires, in addition, data files on which the decomposition is to be performed. Using
MRCCLoader, numu CC selection cuts are applied on the parent event. The events so obtained are subject to a combination of cuts, all of which are inputs to the constructors. The binning and variable that the decomposition is done in, are also constructor inputs.
The class contains accessor functions that return the NC and MRCC input spectra, NC, numu CC and nue decomposed spectra and NC to MRCC
Ratio object which is used in decomposition.
This is the other decomposition currently available, based on the differing numbers of Michel electron candidates expected from the different event classes.
Experiment interface only requires a single function, one that takes an oscillation calculator representing some set of oscillation parameters, and returns a chisq (or log-likelihood).
SingleSampleExperiment is the basic building block of most analyses. This is constructed with a
Prediction object and an observed data
Spectrum. The chisq function calls
Predict() on the
Prediction to get an expected spectrum, and then calculates the log-likelihood with the data.
MultiExperiment simply takes a list of
Experiments and sums up their chisq values. This is useful in the case of multiple runs or sub-samples of the data, to which you want to perform a joint fit.
Experiments are possible that don't use the preceding parts of the framework. For example
ReactorExperiment is a very simple parabola in sin^2(2theta13) to enable the inclusion of reactor constraints into an oscillation fit.
Fitter object is used to drive MINUIT. The constructor is given an
Experiment and a list of oscillation parameters and systematic shifts (see Section Systematics) to determine. Calling the
Fit() function with an
OscCalculator set to some seed values will attempt to minimize the chisq with respect to the chosen oscillation parameters, while the other parameters are held fixed. The
OscCalculator is updated to reflect the best-fit values of the chosen parameters. The return value is the chisq of the best-fit point.
FitVar defines an interface to an
OscCalculator, allowing parameter values to be set or retrieved. This allows fitting for complex parameters such as sin^2(theta23), where the value has to be converted (here to theta23) before passing to the calculator. A
FitVar also contains the name of the parameter for labelling axes. If a parameter is set beyond its physical limit, the
FitVar will set it to the boundary. The
Penalty() function enables the
Fitter to know this has occurred, and drives the minimization back towards the physical region.
If you need to plot a variable not included, simply add it to
Surface is used to calculate a chisq surface as a function of two parameters, and to draw confidence intervals. The constructor takes an
OscCalculator defining the values of oscillation parameters not to be displayed,
FitVars defining the axes, and binning information. You can also pass a list of oscillation and systematic parameters to be marginalized over.
Surface performs a scan over the parameters, recording the chisq value at each point in the 2D space. It also performs a minimization, using
Fitter, seeded at the lowest grid point, in order to find the true minimum. The surface is normalized as a delta chisq from this point.
Surface comes with functions to draw the best fit point and contours. The significance level to draw is provided in the form of a surface describing the critical parameter value. There are functions to provide these trivial surfaces for the case of Gaussian statistics.
In one dimension, the
Slice() function (in
Fit.h) provides a similar service.
This class facilitates making and drawing numu style fits where both hierarchies are considered simultaneously and plotted on two adjacent axes. The difference over simply having two
Surfaces is that the minimum chisq point must be shared between them. Also, setup of the fit, and layout of the final plots is simplified.
One goal of CAFAna is to be able to make high quality plots by default, for example suitable for inclusion in a talk. There is quite a lot of logic within the
Surface class dedicated to making countour plots look good. For example, converting the contours to TGraphs so that contours drawn with dashed lines look correct, and picking suitable binning so that contours behave correctly when intersecting the boundaries of the plot.
In addition, there is a file
Plots.h, containing plotting utility functions.
DataMCComparison(), for example, ensures that the y-axis of such a plot extends to cover the full range of both histograms, not just whichever is drawn first.
In future, further functions and objects should be added here, to aid in the construction of a broad range of different plots. Any standard analysis plot should be easy to make attractively.
This is a big enough topic to deserve a page of its own.
The nue analysis¶
The implementation of the nue analysis is very simple. Analysis preselection cuts may be applied by a
Cut object, and then either a PID or calorimetric energy used as the binning variable. The defined
FitVars allow the usual contours to be made in sin^2(2theta13), delta, and sin^2(theta23).
For evaluating any sensitivity gain from two-dimensional fits, the
Var2D function may be used. This simply produces a
Var that reads two variables from the
StandardRecord and "flattens" them into a one-dimensional spectrum. This allows a two-dimensional fit to be carried out without any other part of the code having to make any special accommodations.
To perform a joint fit to FHC and RHC data, a
MultiExperiment can be used, having as its two children the
SingleSampleExperiments representing the two running conditions.
The numu analysis¶
The numu analysis is somewhat more complex, currently having three separate spectra: contained quasi-elastic, contained non-quasielastic, and uncontained, each with their own selectors and energy estimators.
Vars representing all of these exist, and the three sub-analyses are again combined in a
MultiExperiment to extract the final contour. The
NumuAnalysis class provides a handy way to construct all the necessary spectra for a fake-data study.
Often one of the longest parts of an analysis is waiting for the spectra to load from the input files. The serialization mechanism allows you to do this once up-front and then tweak your analysis without having to redo the loading.
The global template functions
LoadFromFile() facilitate this. Once you've called
Go() on your loaders, you can save the results to a file and later retrieve them immediately. Normally you just need to save your top-level
Experiment object, all its components will be saved recursively.
The storage format is relatively human-readable, which might be useful for various debugging or non-standard plotting purposes.
MakeEventListFile() will go through a set of input files (specified by wildcard) and make an output file containing the list of events that pass a given
Cut. This list can then be passed to other tools to make a file for scanning etc.
This class can do two related, but different, things. First, it can create an output file from an input file using only those events that either pass a
Cut, or are specified in a file listing the desired events. Second, it can
transfer only the specified
Vars to the output file, filling all other fields with NaN. This allows the output file to be much smaller (NaNs compress well), but sadly doesn't cause it to load much faster.
You will probably find that everything goes around 2x faster if you use an optimized build:
export SRT_QUAL=maxopt; srt_setup -a