by Alex Drlica-Wagner


The Toolkit for Multivariate Data Analysis in ROOT has become one of the standards for attacking multivariate classification problems in high energy physics and astro-particle physics. It provides an integrated machine learning environment with access to roughly a dozen multivariate classification algorithms. From past experience, I am most familiar with the TMVA implementation and performance of Boosted Decision Trees.

In accord with the convention of the SG Challenge, I've defined the "signal" to be galaxies and the "background" as stars.

Simple Test

As a very simple first example of the TMVA machinary, I trained a forest of 500 BDTs using a very simple set of morphological variables:
  • spread_model_g and spreaderr_model_g
  • spread_model_r and spreaderr_model_r
  • spread_model_i and spreaderr_model_i

More than anything else, this simple test gives a chance to demonstrate some of the diagnostic information provided by TMVA.

Output Variable Distribution of the output classifier variable. Solid histigram represents the training sample (70% of the training sample provided by the SGC), while data points represent the other 30% of the sample used for validation of the training.
Corr. Matrix (Sig) Linear correlation between the variables for the signal sample.
Corr. Matrix (Bkg) Linear correlation between the variables for the background sample.
ROC Curve Receiver operating characteristic (ROC) curve showing signal efficiency vs. background rejection.
Sig. Efficiency Curves for determining optimal cut value for output variable.

Round 1

Morphological Variables

I start with a large set of morphological variables in all 5 bands. The only pre-cut that I make is to require before training is that mag_psf_i < 99 (based on a report from Ryan Keisler that there are some strange objects in the catalog). For testing, I don't apply any precuts.

SIG Correlation Matrix BKG Correlation Matrix

Magnitude Variables

I next choose a large set of magnitude variables in all 5 bands. The precuts are the same as before.

SIG Correlation Matrix BKG Correlation Matrix

The Kitchen Sink

Now I throw in both the morphological and spectral variables (removing the unnecessary variables found above) and adding class_star . This classifier should serve as a benchmark for the best possible performance (assuming these algorithmic specifications). However, in my experience, the risk of over-training and inconsistencies between the training and application data sets makes it dangerous to depend on a classifier with so many input variables. Thus, the goal is to attempt to approach this performance with the minimum number of variables.


I've gone through and made a first round of simple trimming to cut out variables that were found to have little impact. This removes all y-band variables in addition to mag_aper_11_* and mu_eff_model_*. Below I've plotted the performance of these classifiers (and their trimmed versions) using the metric described on the SG separation challenge page. The magnitude-dependent completeness/purity plots are created for a generic predictor cut value of 0.8, which may not be optimal for all classifiers plotted (but seems to be a reasonable estimate).

NOTE: I have severe doubts about this metric since it depends on the composition of the test sample). To this end I have also plotted a more conventional ROC which should be insensitive to the test sample composition.

Completeness vs. Purity Magnitude Dependence Bkg. Rejection vs. Sig. Efficiency


In the interest of relative "simplicity" it would be nice not to mix multivariate classifiers if isn't necessary. In this hope, I retrain the trimmed combined classifier without the class_star variables for any band. Somewhat surprisingly, the performance of the classifier is unaffected by this trimming. However, since class_star was derived from the more basic variables that are input into the BDT, it is plausible to conclude that class_star may not contribute any additional information.

Completeness vs. Purity Magnitude Dependence Bkg. Rejection vs. Sig. Efficiency

For this classifier I also use TMVA to calculate a predictor cut value meant to optimize the signal-to-noise ratio defined as (SigEff) / sqrt( SigEff + BkgEff). This ratio is maximized at a predictor cut value of >0.8075 where the SigEff = 0.9668 and BkgEff = 0.1695.

Predictor Cut Value

For the SG challenge test sample (r1), this corresponds to a purity = 98.59% and a completeness = 96.88% (integrated over all magnitudes). However, the signal-to-noise ratio is only one way to decide where to place the cut on the classifier output. For example, putting a cut >0.30 yields 99.1% completeness at 98.1% purity. Remember, the test sample is heavily dominated by galaxies and thus large increases to the false positive rate will manifest themselves as only small changes in the purity.

Band-by-Band Classification

I've also trained classifiers in each band individually (for g,r,i,z,y). These classifiers use the same morphological, spectral, and class_star variables as the trimmed global classifier, but are separated into classifiers for each band. As expected, these classifiers are significantly less powerful than the global classifier.

Completeness vs. Purity Magnitude Dependence

Aggressive Trimming

Encouraged by the successful removal of the class_star variables, I did some more aggressive trimming of variables. I iteratively removed mag_aper_3_*, mag_aper_4_*, mag_auto_*, mu_max_model_*, and mu_mean_model_* variables without appreciable loss in the performance integrated over all magnitudes (re-introducing the mag_aper_3_* variables may be considered if better performance is required). Examining the performance as a function of magnitude, it can be seen that we lose some performance from mag_auto_i > 24; however, it should be noted that these plots are dependent on the cut value for the classifier output (which is not the same for the two classifiers).

Completeness vs. Purity Magnitude Dependence

I also plot an alternative metric (more consistent with the conventional definition of the ROC), which shows the Signal Efficiency = True Positive Rate and the Background Rejection = 1 - False Positive Rate. This presents a much more illuminating view of the classifier behavior at large magnitudes (where the galaxy/star ratio is the dominant contributor to the purity).

Bkg. Rej. vs. Sig. Eff. Magnitude Dependence Predictor Cut Value

The maximum signal-to-noise ratio for the aggressively trimmed classifier occurs at a cut value of 0.825 where the purity is ~98.7% and the completeness is ~96.0%. Another possible cut at a classifier value of >0.30 yields ~99.0% completeness at ~98.1% purity. The trimmed set of 24 variable (down from 44), contains 6 unique variables for each of the 4 bands:


Additionally, I examined using the difference of mag_model_* - mag_psf_* rather than the individual variables. This yields comparable performance with the added benefit that the explicit magnitude dependence is mitigated. However, this has the added (technical) complication of being a composite variable. For the time being it seems reasonable to stick with the individual input of mag_model_* and mag_psf_*.

Comparison with the "Modest Proposal"

The modest proposal for star-galaxy classification in the SV-A1 gold v1.0 catalog predominantly utilizes i-band information for spread_model and spreaderr_model to classify objects. I take the cut defined on Eli's redmine page and apply it to the test sample of events for the SG challenge. I then compare the relative completeness and purity of the outputs. I make this comparison for two different cut values on the agressively trimmed TMVA BDT:
  1. A cut of >0.8 which roughly approximates the maximum in the signal-to-noise ratio (and yields a very clean sample)
  2. A cut of >0.2 which is meant to give comparable purity to the modest proposal.

As can be seen below, the SNR cut yields a sample that is significantly more pure than that of the modest proposal (a factor of ~40% decrease in the background contamination). However, this cut also yields a decreased completeness. The looser cut meant to roughly match the purity of the modest proposal yields a completeness that is significantly larger. In the lower row of plots I show the completeness (signal efficiency) vs the background rejection. In this case, the background rejection is purely a function of the background sample, and thus is not complicated by the population statistics in the way that the purity is.

SNR Cut Purity Cut


I've trained a set of TMVA Boosted Decision Trees (BDTs) classifiers using a broad range of morphological and spectral variables (and a combination of the two). Unsurprisingly, the best performance comes from the combined classifier. I've made some attempts to simplify this classifier by removing some of the less influential variables. This aggressively trimmed classifier has a signal-to-noise ratio maximized at a cut value of >0.825 where the classifier purity is ~98.7% and the completeness is ~96.0%. Another possible cut at a classifier value of >0.30 yields ~99.0% completeness at ~98.1% purity. However, I highly recommend against using the purity as a classifier comparison metric (due to its dependence on the composition of the test sample).

Below I've attached the input (yaml) configuration file used for training these classifiers as well as the output performance of the classifiers on the SG challenge (r1) test data set.
  • Trimmed set all variables without class_star (config) (txt) (root)
  • RECOMMENDED Aggressively trimmed set of variables (config) (txt) (root)

Round 2

Aggressive Classifier

For round 2, I train an new classifier using nearly all of the same variables as the final aggressively trimmed classifier from round 1. However, in response to reported issues in the determination of mag_psf in SV-A1, I instead use the mag_auto variables. As a comparison, I also apply the final round 1 classifier without retraining. As was noted before, the mag_psf variables are slightly more powerful than the mag_auto variables; however, removing them is not overly detremental. Additionally, I re-train this classifier using bins of mag_auto_i: mag_auto_i < 21, 21 <= mag_auto_i < 23, 23 <= mag_auto_i < 24. These classifiers provide slightly better performance than the global classifier in their respective magnitude bins; however, since the original classifier included mag_auto_i in the training, the gains are not expected to be great.

In the plots below I show the ROC curves and magnitude-dependent performance for the Round 1 classifier applied to the Round 2 test set (labeled AGRO), the new Round 2 classifier using mag_auto variables (labeled AUTO), and the three independent classifiers trained in bins of mag_auto_i (labeled AUTO_...). The performance of the binned classifiers are assessed by applying them only to test events residing in the magnitude bins for which they were trained. For creating the magnitude dependent performance a generic cut was applied at a output value of 0.8 (roughly optimized to the SNR). I also plot the magnitude dependence of the ROC variables.

ROC Magnitude Dependence Magnitude Dependence (ROC)


The performance of the BDT has degraded very slightly in round 2 due to the replacement of mag_psf variables with mag_auto variables (this loss in statistical power should be recovered by reduced systematic errors). Training of classifiers in independent magnitude bins yields slightly improved performance. When implementing binned classifiers, it would be best to smoothly weight the output probabilities to avoid sharp, magnitude-dependent features in the star/galaxy distributions.

Classifier Validation

The goal here is to validate the performance of the classifier on the full SVA1 data set. It has quickly become clear that the classifier is very different in SVA1 than over the COSMOS field.

Output Distribution

The first step is to compare the classifier output in the COSMOS training/testing region to the output over all of SVA1 (normalized to the number of objects).

It is clear from these plots that the output distribution is significantly different between the COSMOS and SVA1 fields. This may be due to the increased depth and better-than-average weather conditions for the COSMOS observations. The result is that we cannot assume that the completeness numbers derived in the COSMOS field extend to the rest of SVA1. This means we will likely want to directly normalize the number of objects in our star/galaxy samples to compare between the modest classifier and the other classifiers.

MODEST 2398453 (==2) 22614179 (==1)
AGRO < 0.0689 > 0.205
AUTO < -0.0406 > 0.106

Spread Model

Since it is clear that the BDT classifier performance is different between COSMOS and the full SVA1 field, we would also like to perform a similar check for the modest classifier. The modest classifier does not return a probablistic output, but it is simple enough that it dominantly depends on one composite variable (spread_model_i + 3*spreaderr_model_i). By plotting the distribution of this variable we should get a sense for how consistent the COSMOS and SVA1 results are.

Spread Model

The distribution of the spread model variable appears to be fairly different between COSMOS and all of SVA1. The most noticiable difference appears in the galaxy-dominated (right) portion of the distribution. The amplitude of the stellar spike has also changed (likely due to the proximity of SPTE to the LMC) and it's position has slightly shifted. However, it appears that the modest classifier cut (dashed lines) is adequately (if not better) matched to the full SVA1 sample.


It can be seen that the stellar samples coming from the BDT classifiers have significantly longer tails towards the right (galaxy side) of the distribution. While we expect the stellar spike to extend some past the 0.003 cut of the modest classifier, this long tail is troubling.

Stellar Locus

Selecting comparable numbers of stars and galaxies, we can create color-color diagrams to examine the stellar locus. The stellar locus appears a bit tighter for the BDT classifiers, but there also appear to be a set of outliers with g-r ~ 0 and r-i ~ 0. From further investigation this population is largely dominated by objects with mag_auto_r ~ 99. This clump can be seen in all three galaxy distributions, but does not appear in the


Round 3

Based on the results of variable validation, we've eliminated flux_radius and mu_max from the training set due to their dependence on the PSF. This essential reduces our variable selection to spread_model, spreaderr_model, and mag_auto in each band. Out of a fear of over-training on the main sequence, we have also trained a classifier that utilizes only the i-band magnitude and morphological variables. As requested, we have also trained both of these classifiers for the COSMOS field alone.

Name Description
AUTO spread_model_*, spreaderr_model_*, mag_auto_*, full training set
AUTO COSMOS spread_model_*, spreaderr_model_*, mag_auto_*, cosmos training set
MORPH spread_model_*, spreaderr_model_*, mag_auto_i, full training set
MORPH COSMOS spread_model_*, spreaderr_model_*, mag_auto_i, cosmos training set
ROC Magnitude Dependence Magnitude Dependence (ROC)