Project

General

Profile

Steve Mrenna's Pythia tuning using R

Background

On July 29, 2010 Jim K. and Marc and Steve had a discussion about one of the analyses Steve has done with CDF data and is interested in pursuing with CMS data. We discussed this with him with the goal of seeing how we might make the task easier or faster by use of other data analysis tools we have come to use.

The task we discussed involves statistical analysis of "toy reconstruction" of Pythia events, and comparison of the distribution of parameters of interest from this Pythia output with distributions of the same parameters from collider data. The goal is to determine how the simulation output needs to be modified to better match the observed collider data, so that the simulation can be better understood and improved.

The "toy reconstruction" output from the Pythia events can be summarized in four types of records:
#. an event header that contains a run number, event number, and a few quantities global to the event (such as the pt-hat for the collision)
#. a missing pt record that contains a record identifier (ptmiss) and pt and phi
#. a lepton/photon record that contains a particle type (e/mu/tau/photon), pt, eta, and phi
#. a jet/b record that contains a particle type (j/b), mass, pt, eta, and phi

Several hundred thousand "toy reconstructed" events are used in the analysis. Distributions of the reconstructed quantities are compared to a set of many (hundreds?) of histograms from the collider data; the quality of the agreement between the shapes is evaluated (using a Kolmogorov-Smirnov test?) and then modifications of the "toy reconstruction" distributions are evaluated to optimize the agreement.

Data storage issues

Steve told us that the "toy reconstruction" output is often not saved, because it is fast to reproduce. It is also possibly inconvenient to store because of a lack of tools.

CMS favors storing such information in Root files.

We hope to find a storage format that is more convenient in several ways:

#. The format should be accessible from a variety of programming languages or tools
#. The format should be reasonably compact
#. The format should support parallel reading

More to come