NC Cosmic BDTs

PID Basic

This study has been done based on TMVA framework, and was inspired by the μ neutrino cosmic BDT. The algorithm has been employed is the real adaptive boosted decision trees, which is also used by numu and nue cosmic BDT. Boosting is a method to enhance the classification and increase the stability with respect to statistical fluctuations based on weak classifiers, decision trees in our case, by reweighing versions of the input training data. Adaptive boosting is the first successful boosting algorithm produced for binary classification. Therefore, in the first attempt, we apply it to separate NC signal and backgrounds, mainly cosmic events. We will introduce the NC Cosmic BDT by training and application phase separately. In the training phase section, we are going to show the basic of the algorithm, how to produce the input files based on CAF, where to tune the hyper-parameters in TMVA macro, what are the input discriminating variables. In the application phase section, we will show how do we integrate the weight file, trained models, into NOvA Art framework and then apply it into CAF framework by two ways.

Training Phase

How the algorithm works

Each input data in the training set has been given a weight. In the first training iteration, all the data has the same weight. Then, the mis-classified events of a decision tree are given a higher event weight in next training iteration. The original weight is weight(i) = 1/n, where i is the i'th training data and n is the number of total training data. The misclassification rate is computed for each trained model. error = (N - Correct)/N, where error is the misclassification rate, correct is the number of training data which has been predicted correctly by the corresponding model and N is the total number of training data. Further, the misclassification rate is modified by using the weighting of the training data to define werror. It is the weighted sum of the misclassification rate, which equals summed (weight(i)*e(i))/summed (weight(i)), where e(i) is the prediction of i'th event. The e(i) is one if it is misclassified, otherwise, it is zero. After that, a stage value has been defined which is computed for the trained classifier to provide a weighting for the predictions of the model. Stage Value = ln(1-werror)/werror. Therefore, the more accurate models have more weight or contribution to the final prediction of the ensemble model. The training wights are updated every iteration to give higher weight to incorrectly predicted data,by weight(i) =weight(i) * exp(stage value * e(i)). The trees are added sequentially, trained using the weighted training data. The training phase continues until a pre-set number of tress have been build or no further accuracy improvement. Once the process is completed, we have a pool of trees each with a stage value. Then the prediction of adaptive BDT will be made by calculating the weighted prediction of the trees.

How to produce input files

This is done in the CAF framework. A macro, ProducingSA.C, has been produced for the object. To run the macro, simple do

cafe ProducingSA.C

The result root file has four trees: ncTree (NC Signal), muTree (charged-current μ neutrino events), neTree (charged-current e neutrino events), and csTree (cosmic events). They are selected by the true information. More than 100 CAF variables have been selected which is much more than the input discriminating variables we will use. Also, the input files for the macro are:

1000 nonswap files


and 100000 cosmic files

We did not consider the variation of the different periods of files for the first try. Sijith is re-turing the algorithm with considering the different periods of files, the result can be found in .

 const std::string fnamecos = "defname:prod_limitedcaf_R16-03-03-prod2reco.a_fd_cosmic_full_nueveto_v1_goodruns with limit 100000 with stride 2";
 const std::string fnamenc  = "defname:prod_caf_R16-03-03-prod2reco.f_fd_genie_nonswap_fhc_nova_v08_period2_v1_prod2-snapshot with limit 1000";

By modifying the above lines in the macro, the input files can be redefined.

How to tune the Hyper-Parameters

The training and testing session is done in TMVA framework. A corresponding macro, TrainingSA.C, aims to perform this process. The TrainingSA.C is modified from TMVAClassification.C, the official TMVA training macro.
To run the macro, simple do the following under ROOT environment.

root -l TrainingSA.C

The training phase begins with instantiating a Factory object with selected configuration options.

TMVA::Factory *factory = new TMVA::Factory( "Job Name", outputFile, "configuration options");

For our first attempt, we employed a very simply configuration as shown:
TMVA::Factory *factory = new TMVA::Factory( "SA", outputFile,"V");

There are six types of configurations, Verbose Flag, Color Screen, Transformations, DrawProgressBar, AnalysisType, and Silent. Among the six types of configurations, AnalysisType and Transformations play vital role. AnalysisType defines which type of analysis we want to perform: Classification, Regression, Multiclass, or Auto. Transformations options define the data preprocessing types. TMVA preprocess the input variables or the training events before sending them to the selected algorithm. This can help to reduce correlations among the input discriminating variables. The five ways to perform input variables transforming in TMVA are :

  1. Variable normalisation;
  2. Decorrelation via the square-root of the covariance matrix ;
  3. Decorrelation via a principal component decomposition;
  4. Transformation of the variables into Uniform distributions (“Uniformization”);
  5. Transformation of the variables into Gaussian distributions (“Gaussianisation”).

More related details can be found in Chapter 3 of TMVA User Guide

Then the root file produced by ProducingSA.C will be handed to the Factory, and will be splitted into one training and one test ROOT tree. This process can guarantees a statistically independent evaluation of the selected algorithm based on the test sample. The numbers of events used in both samples are decided by the user. We used half of the input dataset for the training, the others for the testing.

How to define the discriminating variables or spectators

The following lines shows the thirteen input variables which we used:

factory->AddVariable( "cosmicid",         "CVN Cosmic ID",                  'F' ); 
factory->AddVariable( "partptp",          "PartPtP",                        'F' );                                
factory->AddVariable( "shwnhit",          "Leading Prong Number of Hits",   'F' );                                
factory->AddVariable( "shwxminusy",       "X View - Y View",                'F' );                                
factory->AddVariable( "shwxplusy",        "X View + Y View",                'F' );                                    
factory->AddVariable( "shwxovery",        "X-Y/X+Y",                        'F' );                                    
factory->AddVariable( "shwcalE",          "Leading Prong CalE",             'F' );                                    
factory->AddVariable( "shwdirY",          "Leading Shower Y Direction",     'F' );                               
factory->AddVariable( "shwlen",           "Leading Shower Length",          'F' );                              
factory->AddVariable( "shwwwidth",        "Leading Shower Width",           'F' );                                  
factory->AddVariable( "shwGap",           "Leading Shower Gap",             'F' );                               
factory->AddVariable( "nshwlid",          "Number of Shower/Prong",         'F' );                                    
factory->AddVariable( "nmiphit",          "Number of MIP Hits in the slice",'F' );                                 

The variables in the input trees used to train the selected algorithm are registered with the Factory object by the AddVariable method. We will take the above example to show the detail. AddVariable takes the variable name (e.g. cosmicid), which must have a correspondence in the input file. A ND Data/MC study about the input variables can be found in ND_Data/MC_Comparison.
To add new variable into the Factory, just do

factory->AddVariable( "Your Var's name in the input tree",   "The Var's physics meaning",   'The Var's type, floating point (F) or integer (I)' ); 

To remove a variable for some reasons, just comment that corresponding line.

How to book an algorithm

The below line expresses the algorithm and the hyper-parameters which we used:

if (Use["BDTA"])   
     factory->BookMethod( TMVA::Types::kBDT, "BDTA",                             

In TMVA framework, there are total thirty one hyper-parameters. More related details can be found in page 114 to 117 of TMVA User Guide

Application Phase

There are three ways can perform the application phase:

  1. In the TMVA framework, a corresponding macro, ApplicationSA.C, aims to perform this process. This method can help us have a better idea about the classifier prediction abibitly without considering the Nus Official Cuts.
  2. In the CAF framework, Gavin has produced a on-the-fly way to add the variable into the CAF. The corresponding macros are NusVarsTemp.h and NusVarsTemp.cxx
  3. Also, we have added the variable into CAF by Art Module. The below section will show the details about the last method.

How to produce the ART module

A new package, svn/trunk/NCID, has been produced by Enhao for this analysis under NOvA-ART framework. All the corresponding modules will be found inside NCID.
In the, we define the path to the weight file and AddVariable method in the beginRun function. Also, in the produce function, we read the corresponding input variables and fill their values into the trained models.

How to get the BDT into CAF

Jose introduces how to add the variables into CAF in Adding Vars into CAF