Project

General

Profile

Creating Toy DataMC

Strategy

The plan here is to exercise as many steps in the professor workflow in a well controlled (though possibly, challenging) problem space.

The challenging part of this test is using a functional form that involves a power law ~ |x+b'| (2+a') .

This testing gives professor an advantage in that the functional form used to generate both data and MC are the same (just unknown parameters), while for real problems e.g. the hadronic code, that functional form is not known and for parts of phase space might not even be close.

Procedure

Initial Setup

ssh geant4gpvm01.fnal.gov

source /geant4/app/users/altups/setup_professor.sh

cd /geant4/app/users/${USER}
mkdir toyprofessor
cd    toyprofessor

Seed File Structure

Choose number of parameters and sampled points

export NPARAM=3
prof2-ncoeffs ${NPARAM}
yields
3 dimensional parameter space:

Polynomial order     Minimum samples
         0                   1
         1                   4
         2                  10
         3                  20
         4                  35
         5                  56
         6                  84
         7                 120
         8                 165
         9                 220
        10                 286

Create param.ranges

# define param space

# pname pmin pmax
a       -1   +1
b       -1   +1
c       -1   +1

# end-of-file

Create template for histogram generator input params.json to accompany params.dat

{{ "params": {{
      "a": {a},
      "b": {b},
      "c": {c}
  }}
}}

Create scan directory with sub-directories 0000 ... ${NPOINTS}, each with params.dat and params.json

export NPOINTS=100
export TEMPLATEFILE=params.json

# creates "scan" directory
prof2-sample param.ranges -n $NPOINTS -t $TEMPLATEFILE

Make a "fake" data version

# and one "data" directory
mkdir data
cp scan/0000/params.json data/params.json  # just as a template

# adjust the value in data/params.json by hand (semi-randomly)

# these are my particular hand-chosen nonsense values
cat data/params.json | python -m json.tool
yields
{
    "params": {
        "a": -0.12345,
        "b": 0.67891,
        "c": -0.42424
    }
}

Create Histograms for the "data" and each scan value

Using version 1 of the toy data generator histograms were created for "data" and the "scan" subdirectories.
A single foo histogram with 50 x bins [0:20] was filled from func1 = (400.-c*100.*pow(abs(x-5.0*b)/10.0,2+a)) (a, b, c being the "model" parameters to be fit for).

# run the "data" generator for this set of values
cd data
../create_toy_data.py 
params object:
{u'params': {u'a': -0.12345, u'c': -0.42424, u'b': 0.67891}}
   ... <stuff>...
cd ..

# now need to create .yoda files for each scan subdirectory based on it's own "params.json" 
cd scan
for d in * ; do cd $d ; echo "scan $d" ; ../../create_toy_data.py > $d.log 2>&1 ; cd .. ; done
cd ..

Find Interpolation Parameterization

prof2-ipol scan

# ideally we've visualize the parameter interpolation
prof2-I ipol.dat
# Problem with getting & configuring matplotlib/WX interface method: Matplotlib backend_wx and backend_wxagg require wxPython >=2.8.12

# tell us something about the interpolation
prof2-ls ipol.dat
yields
Objects:
  /foo

Ranges of parameters:
  -0.993464    0.994585    a
  -0.986894    0.984702    b
  -0.995122    0.981831    c

Perform Tuning Stage

# actual do parameter tuning
prof2-tune -d data ipol.dat
gives (for default 3rd order polynomials on parameters)
...
---------------------------------------------------------------------------------------
fval = 24.33527195546708 | total call = 144 | ncalls = 144
edm = 1.506543709322404e-07 (Goal: 1e-05) | up = 1.0
---------------------------------------------------------------------------------------
|          Valid |    Valid Param | Accurate Covar |         Posdef |    Made Posdef |
---------------------------------------------------------------------------------------
|           True |           True |           True |           True |          False |
---------------------------------------------------------------------------------------
|     Hesse Fail |        Has Cov |      Above EDM |                |  Reach calllim |
---------------------------------------------------------------------------------------
|          False |           True |          False |            u'' |          False |
---------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------
|      | Name  |  Value   | Para Err |   Err-   |   Err+   |  Limit-  |  Limit+  |          |
----------------------------------------------------------------------------------------------
|    0 |     a = -0.1835  |  0.3209  |          |          |          |          |          |
|    1 |     b =  0.468   |  0.2646  |          |          |          |          |          |
|    2 |     c = -0.3523  |  0.07486 |          |          |          |          |          |
----------------------------------------------------------------------------------------------

**********************************************************************
Minimisation finished after 0.312767028809 seconds
'chi2': 24.34 --- Ndf : 47 --- ratio : 0.52

Try this again with a higher order polynomial on the parameters

prof2-ipol scan --order=4 --summ="this order 4" 
mv ipol.dat ipol-order4.dat
prof2-tune data ipol-order4.dat
yields
...
---------------------------------------------------------------------------------------
fval = 26.06357120025671 | total call = 106 | ncalls = 106
edm = 1.8732281251790123e-06 (Goal: 1e-05) | up = 1.0
---------------------------------------------------------------------------------------
|          Valid |    Valid Param | Accurate Covar |         Posdef |    Made Posdef |
---------------------------------------------------------------------------------------
|           True |           True |           True |           True |          False |
---------------------------------------------------------------------------------------
|     Hesse Fail |        Has Cov |      Above EDM |                |  Reach calllim |
---------------------------------------------------------------------------------------
|          False |           True |          False |            u'' |          False |
---------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------
|      | Name  |  Value   | Para Err |   Err-   |   Err+   |  Limit-  |  Limit+  |          |
----------------------------------------------------------------------------------------------
|    0 |     a = -0.07236 |  0.2238  |          |          |          |          |          |
|    1 |     b =  0.4548  |  0.1886  |          |          |          |          |          |
|    2 |     c = -0.3163  |  0.06111 |          |          |          |          |          |
----------------------------------------------------------------------------------------------

**********************************************************************
Minimisation finished after 0.25261092186 seconds
'chi2': 26.06 --- Ndf : 47 --- ratio : 0.55

Oddly, this give a worse χ 2 than the 3rd order.

Plots and Initial Summary

As the python based visualizations aren't working the working data was visualize with ROOT:

cd data
yoda2root hists.yoda
root draw_fits.C\(1.0\,\"hists.root\")
to look at the generated "data" and the function (red) it was derived from:

data and function for scale=1

where the script is given here.

There is a fair amount of bin-to-bin scatter so data and "scan" entries were regenerated with more data (scale=100.).
Then the data and function look like:

data and function for scale=100

Then the prof2-ipol and prof2-tune steps were re-run:

scale=1, nscan=100 scale=100, nscan=100
param truth 3rd order val para err 4th order val para err 3rd order val para err 4th order val para err
a -0.12345 -0.1835 0.3209 -0.07236 0.2238 0.2626 0.1099 -0.1586 0.02614
b 0.67891 0.4680 0.2646 0.4548 0.1886 0.3915 0.03148 0.7053 0.04798
c -0.42424 -0.3523 0.07486 -0.3163 0.06111 -0.3059 0.01512 -0.4281 0.02207
χ 2 24.34 26.06 31.64 30.05
Ndf 47 47 47 47
ratio 0.52 0.55 0.67 0.64
$ root -l plot_fits.C\(0.2626,0.3915,-0.3059,-0.1586,0.7053,-0.4281\)
root [0] 
Processing plot_fits.C(0.2626,0.3915,-0.3059,-0.1586,0.7053,-0.4281)...
Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1
ffunc "f0" (fit=0,col=632) : 100.000000*(400.-(-0.424240)*100.*pow(abs(x-5.0*(0.678910))/10.0,2+(-0.123450)))
ffunc "f1" (fit=1,col=418) : 100.000000*(400.-(-0.305900)*100.*pow(abs(x-5.0*(0.391500))/10.0,2+(0.262600)))
ffunc "f2" (fit=2,col=602) : 100.000000*(400.-(-0.428100)*100.*pow(abs(x-5.0*(0.705300))/10.0,2+(-0.158600)))
root [1] 

fit 3rd and 4th order parameters, scale=100, 100 scan points

where red = true functional form, green = 3rd order fit, blue = 4th order fit.

More Scan Points

Let's try more test points, instead of 100 .. 500

export NPOINTS=500
export TEMPLATEFILE=params.json

# creates "scan" directory
prof2-sample param.ranges -n $NPOINTS -t $TEMPLATEFILE

 cd scan ; for d in * ; do cd $d ; echo "scan $d" ; ../../create_toy_data.py > $d.log 2>&1 ; cd .. ; done ; cd ..

Then the prof2-ipol and prof2-tune steps were re-run, which resulted in:

Warning --- parameters are outside the validity of the parametrisation:
 a=-2.439647 ! in [-0.997829,0.998317]
 b=-0.997538 ! in [-0.995680,0.993115]
 You might want to impose limits (--limits) on those parameters.

so a limits.dat file was introduced:

a -1 1
b -1 1
c -1 1

and further prof2-tune calls added --limits=limits.dat

scale=100, nscan=100 scale=100, nscan=500
param truth 3rd order val para err 4th order val para err 3rd order val para err 4th order val para err
a -0.12345 0.2626 0.1099 -0.1586 0.02614 -0.07314 0.1817 -0.1498 0.02976
b 0.67891 0.3915 0.03148 0.7053 0.04798 0.5657 0.1161 0.7049 0.04583
c -0.42424 -0.3059 0.01512 -0.4281 0.02207 -0.3748 0.04256 -0.4322 0.0205
χ 2 31.64 30.05 29.77 20.08
Ndf 47 47 47 47
ratio 0.67 0.64 0.63 0.43
$ root -l plot_fits.C\(-0.07314,0.5657,-0.3748,-0.1498,0.7049,-0.4322\)
root [0] 
Processing plot_fits.C(-0.07314,0.5657,-0.3748,-0.1498,0.7049,-0.4322)...
Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1
ffunc "f0" (fit=0,col=632) : 100.000000*(400.-(-0.424240)*100.*pow(abs(x-5.0*(0.678910))/10.0,2+(-0.123450)))
ffunc "f1" (fit=1,col=418) : 100.000000*(400.-(-0.374800)*100.*pow(abs(x-5.0*(0.565700))/10.0,2+(-0.073140)))
ffunc "f2" (fit=2,col=602) : 100.000000*(400.-(-0.432200)*100.*pow(abs(x-5.0*(0.704900))/10.0,2+(-0.149800)))
root [1] 

fit 3rd and 4th order parameters, scale=100, 500 scan points

where red = true functional form, green = 3rd order fit, blue = 4th order fit.


Histogram Generator - Versions 1 and 2

This python script creates a yoda file containing a single histogram foo.
The foo histogram has 50 bins [0:20], where each bin is filled with a number of entries based on a Poisson distribution with the mean of the func1 function evaluated at the bin midpoint.
The function in this case is func1 = 100*(400.-c*100.*pow(abs(x-5.0*b)/10.0,2+a)) where a, b, c are the parameters professor is to tune for, while x is the abscissa of the foo histogram.

#! /usr/bin/env python2.7
""" 
This is a script to create data for toy "problem" for professor
Uses values in params.json  (professor uses params.dat)

""" 

import random, yoda, json
from numpy.random import poisson
from sympy import symbols

myfile = open("params.json","r");

myjson = json.load(myfile)
print "myjson object: ",myjson

params = myjson["params"]
print "params object: ",params

a = params["a"]
b = params["b"]
c = params["c"]

x, y, z = symbols('x y z')

scale = 1.0     # version 1
scale = 100.0   # version 2
func1 = scale*(400.-c*100.*pow(abs(x-5.0*b)/10.0,2+a))
print 'func1: ',func1

def eval3_1(xx,yy,zz):
    v = func1.subs(x,xx)
    print "func1: ",func1,'xx=',xx,' v=',v
    return v

nbins=50
xmin=0.0
xmax=20.0
h1 = yoda.Histo1D(nbins, xmin, xmax, "/foo")
for b in h1.bins:
    xmid = b.xMid
    n = eval3_1(xmid,0,0)
    if (n<=0):
        continue
    # mean for this bin ... now poisson
    nthrows = poisson(n)
    print b, n, nthrows
    for toss in range(nthrows):
        h1.fill(xmid)

yoda.write(h1,"hists.yoda")

# end-of-script

Data and Fit Visualization with ROOT

// plot our fits
#include <string>
#include <vector>
using namespace std;

#include "TFile.h" 
#include "TH1.h" 
#include "TF1.h" 

double atrue = -0.12345;
double btrue =  0.67891;
double ctrue = -0.42424;

double gscale = 100.0;
int    nfit   = 0;

double xmin =  0.;
double xmax = 20.;

TH1* foo = 0;
vector<TF1*> vfits;

Color_t colors[] = { kRed, kGreen+2, kBlue+2, kCyan+2, kMagenta+2 };

const string fform = "SCALE*(400.-(C)*100.*pow(abs(x-5.0*(B))/10.0,2+(A)))";

// forward declare variants

void plot_fits(double scale=100.0,string fname="hists.root" );
void plot_fits(double a1, double b1, double c1,
               double scale=100.0,string fname="hists.root" );
void plot_fits(double a1, double b1, double c1,
               double a2, double b2, double c2,
               double scale=100.0,string fname="hists.root" );

void plot_1fit(double a, double b, double c);

// actual code

void myReplace(std::string& str,
               const std::string& oldStr,
               const std::string& newStr)
{
  std::string::size_type pos = 0u;
  while((pos = str.find(oldStr, pos)) != std::string::npos){
     str.replace(pos, oldStr.length(), newStr);
     pos += newStr.length();
  }
}

void plot_1fit(double a, double b, double c) {
  // assume a canvas is in play

  char   fname[10];
  sprintf(fname,"f%d",nfit);

  string ffunc = fform;

  char   buff[100];

  sprintf(buff,"%f",gscale);
  myReplace(ffunc,"SCALE",buff);

  sprintf(buff,"%f",a);
  myReplace(ffunc,"A",buff);

  sprintf(buff,"%f",b);
  myReplace(ffunc,"B",buff);

  sprintf(buff,"%f",c);
  myReplace(ffunc,"C",buff);

  Color_t col = colors[nfit];
  cout << "ffunc \"" << fname << "\" (fit=" << nfit << ",col=" << col
       << ") : " << ffunc << endl;
  TF1 *f = new TF1(fname,ffunc.c_str(),xmin,xmax);
  f->SetLineColor(col);
  f->Draw("same");

  vfits.push_back(f);
  ++nfit;
}

void plot_fits(double scale, string fname) {

  gscale = scale;

  TFile *f = TFile::Open(fname.c_str());
  f->GetObject("foo",foo);
  if ( ! foo ) {
    cout << "failed to get foo from " << fname << endl;
    exit(42);
  }
  foo->SetStats(kFalse);
  foo->Draw();

  plot_1fit(atrue,btrue,ctrue);
}

void plot_fits(double a1, double b1, double c1,
               double scale=100.0,string fname="hists.root" ) {
  plot_fits(scale,fname);
  plot_1fit(a1,b1,c1);

}
void plot_fits(double a1, double b1, double c1,
               double a2, double b2, double c2,
               double scale=100.0,string fname="hists.root" ) {
  plot_fits(a1,b1,c1,scale,fname);
  plot_1fit(a2,b2,c2);
}