SVA1 Gold v1.0 SG Validation

Contact: Alex Drlica-Wagner <>

Round 2

Classifier Output

We have run various star-galaxy classification algorithms on the SVA1 Gold Catalog v1.0 and collected their output here. Submissions should follow this set of guidelines...

Submission Guidelines:
  • Classifier output should be submitted in FITS file format
  • Each FITS file should contain two columns: (1) "COADD_OBJECTS_ID" (64-bit int 'K') (2) "CLASS_OUTPUT" (4-byte float 'E')
  • The "COADD_OBJECTS_ID" should match exactly (content and order) with that in sva1_gold_1.0_catalog_basic.fits
  • "CLASS_OUTPUT" should be a real-valued number in the range [0,1]
  • Objects with a more galaxy-like classification should have output values closer to 0, while star-like objects should be closer to 1. (This is an arbitrary choice be follows CLASS_STAR and seems to be the convention for most submissions.)

As a standard for comparison, we have included the modest star galaxy classification from Eli (included by default in SVA1 Gold). However, the modest classifier does not follow the submission guidelines.

Algorithm Output Run Time Documentation Notes
MODEST sva1_gold_1.0_catalog_basic.fits here galaxy-like = 1; star-like = 2
BDT AGRO sva1_gold_1.0_catalog_DES_SGC_AGRO_v0.fits ~200us/object (2.66GHz) here Round 2 BDT trained with mag_model values;
0.1% galaxy sample is ~60% efficient for stars
galaxy = 0, star = 1
BDT AUTO sva1_gold_1.0_catalog_DES_SGC_AUTO_v0.fits ~200us/object (2.66GHz) here Round 2 BDT trained with mag_auto values;
0.1% galaxy sample is ~60% efficient for stars
galaxy = 0, star = 1
Random F sva1_gold_1.0_catalog_Random_F_v0.fits Did not measure; parallelized here RF trained with mag_auto and mag_model values;
Hyper parameters optimized via CV
galaxy = 0, star = 1
Output for round 2 training objects set to exactly 0 or 1
Extra Random F sva1_gold_1.0_catalog_Extra_Random_F_v0.fits Did not measure; parallelized here RF trained with mag_auto and mag_model values;
Hyper parameters optimized via CV
galaxy = 0, star = 1
TPZ sva1_gold_1.0_catalog_TPZ_CLASS_v1.fits parallelized here Output variable named CLASSIFIER_OUTPUT;
galaxy = 0, star = 1
Algorithm input features
BDT AGRO spread_model, spreaderr_model,flux_radius,mu_max,mag_psf,mag_model for g,r,i,z
BDT AUTO spread_model, spreaderr_model,flux_radius,mu_max,mag_psf,mag_auto for g,r,i,z
Random F spread_model,spreaderr_model,mu_max,mag_model,mag_auto for g,r,i,z
Extra Random F spread_model,spreaderr_model,mu_max,mag_model,mag_auto for g,r,i,z
TPZ mag_detmodel, mag_model, mag_psf, spread_model, flux_radius for g,r,i,z;
colors from detmodel, and mag_psf minus mag_model.


Classifier Performance

Classifier Output Distributions

For a first glance at classifier performance, I plot the classifier outputs for the round 2 visible test sample along with the true classification of the objects. A few interesting notes:
  • The output distributions of the classifiers often change drastically between the TEST data set and the SVA1 data set.
  • This change is difficult to interpret, since the depth, observing conditions, and star/galaxy composition of SVA1 is different from the TEST field. However, it appears to be a consistent trend that the relative number of objects with intermediate classifier output values increases relative to the rest of the sample. Nacho: indeed, there are differences in star/galaxy ratio, just because of galactic latitude and presence of LMC (as can be seen in modest_class distributions, which is independent of training). I think the conditions of COSMOS observations (93% of training sample) were fairly homogeneous and it is deeper. This combination may be misleading classifications in SPTE, which constitutes most of SVA1.
  • This suggests that the classifiers are more uncertain about the class of the objects in the SVA1 sample.
  • This is troubling since it suggests that the purity/completeness values derived in the TEST sample will not apply to SVA1 at large.
  • Interestingly, the modest classifier seems more robust transitioning between the data sets. Nacho: I guess this is due to not relying on the training.
  • The feature at P~0.85 that appears for the Random_F algorithms when applied to the SVA1 data set is suspicious.

TBD: figure out why TPZ is more robust. (Nacho's hypothesis) Randomization using errors in attribute vector and allows a more robust generalization.


By magnitude cut:


Magnitude Dependence

In the description of the SDSS star-galaxy classification here, there are a set of plots (copied below) showing the number of star/galaxy objects as a function of magnitude.

SDSS Stars SDSS Galaxies

We wanted to make comparable distributions for the SVA1 objects. We start by plotting of the distribution of objects in the Round2 training sample.

The stellar distribution here looks quite different from that in the SDSS plots. It's true that at large enough magnitudes we expect to run out of stars, but it seems strange for this to be happening for such bright objects. Additionally, the steep rise in the stellar distribution seen in SDSS is much flatter in this sample. To try to dig deeper into this, I plot the distribution of stars and galaxies directly from the COSMOS ACS and COMBO-17 CDF-S field (I haven't been able to find a catalog for the S11 field used by SDSS). These distributions resemble the DES-matched COSMOS training sample more than the SDSS sample.


Flavia Sobreira went on to plot the distribution of star-like objects (using the upper 10% cuts defined above) for each of the classifiers applied all of SVA1 (SM = spread model cut). The sharp features in the stellar distributions for the multivariate classifiers are quite troubling.

For another similar view of this effect, I've plotted the distributions for each classifier individually along with the true star and galaxy distributions from the Round 2 testing sample and the star-like and galaxy-like objects from the test sample and all of SVA1. The difference in depth between the testing and SVA1 samples is obvious. Additionally, because the star-like cut value is set to match the MODEST proposal on the full SVA1 data sample, the other classifiers do not match the testing distribution as well as they could for a different cut value. Interestingly, the MODEST classifier appears to do best at capturing faint stars while TPZ appears to do worst. Could TPZ insensitivity to faint stars play a role in it's more robust performance on the shallower SVA1?

Algorithm Star Cut Magnitude Distribution
Random_F CLASS_OUTPUT > 0.415
Extra_Random_F CLASS_OUTPUT > 0.418

I've further broken down the magnitude distribution of the classifier outputs based on magnitude. The 2d histogram for the modest classifier isn't too useful, but the others are pretty enlightening. All classifiers have a magnitude dependence in their output. NOTE: There is something wrong with the Random Forest output.


Comparison to Spread Model

The "spread_model" variables provide a basic prescription for object classification and is the basis for the modest classifier. The spread_model variables have also been found to shift in coadd tiles with varying depths and may provide a spatially dependent systematic in object classification. Here I choose to look at the distribution of spread model in the testing sample as well as the SVA1 sample. I plot the total distribution and the distribution for star-class objects. For each classifier, the star-class cut is chosen as the 10% of the most star-like objects (based on classifier output). This fraction was chosen to roughly match the output of the modest classifier and to normalize so that the number of stellar-class objects match for all the classifiers. We expect the star-class object sample for the multivariate classifiers to be more complete than the modest classifier counterpart.

Some observations:
  • By construction, the modest classifier contains all objects within the stellar spread_model spike and has no tail to larger values. In the testing sample, this results in a fairly impure stellar sample (~20% contamination?)
  • The other classifiers make less strict cut and select a range of spread_models. Interestingly, many of the models have large tails to high values of spread model. Nacho: it would be interesting to check at what magnitudes does this happen.
  • These tails could be the result of spatially dependent shifts of spread model as observed here
  • The TPZ classifier appears to have the least significant tail at high spread_model values. It also includes the largest fraction of objects within the stellar spike.
Algorithm Star Cut Spread TEST Spread SVA1
Random_F CLASS_OUTPUT > 0.415
Extra_Random_F CLASS_OUTPUT > 0.418

The tails seem to be prevalent at all magnitudes and are not related to LMC either. The map below gives and indication where they cluster within SPTE, but no correction for depth has been done, be warned.

Interestingly this appears to correlate strongly with the regions of large chisq as found by fitting the multi-band photometry to
galaxy templates using the photo-z code as discussed on the des-wl-test email list here

A cut in 'crazy colors' goes a long way towards removing this structure. chisq is the multiband chisq fit of LePhare (see previous linked page for catalog from Carles Sanchez and Chris Bonnett).

Most objects beyond chisq=1000 are stars according to modest_class, except for the ones that are at chisq=1e10 (!).

Comparison to Stellar Locus

Another low level cross check of classifier performance should be possible using the color-color distance of objects from the stellar locus. This selection will not provide as pure of a stellar sample as spread model, but should be fairly independent selection for stellar objects. First, I plot color-color diagrams (using mag_psf) for each of the classifiers for objects (where star-like objects are selected as the to 10% based on classifier output).

Some observations:
  • Galaxy contamination in the stellar sample is especially apparent just above the horizontal branch in the g-r plots and the tails below the locus in the i-z plots.
  • The multivariate classifiers have much tighter loci than the modest classifier in the testing sample.
  • However, many of the classifiers appear to have more contamination than the modest classifier in the SVA1 sample. (An especially clear case is the g-r plot for the TMVA_BDT_AGRO.)
  • Of the classifiers, the TPZ appears to have the least noticeable degradation in the full SVA1 sample.
  • It would be interesting to find out if the improved performance of the TPZ is related to the variable choice or is algorithmic.
  • It should also be possible to use these plots to identify classes of objects that are consistently mis-classified.
Algorithm Star Cut Stellar Locus (g-r) Stellar Locus (i-z)
Random_F CLASS_OUTPUT > 0.415
Extra_Random_F CLASS_OUTPUT > 0.418

Stellar Locus Distance

It the hope of simplifying these two-dimensional stellar locus plots into a single distinguishing variable (like the previous spread model distributions), I derive a distance from the stellar locus for each object. I define the stellar locus by eye from the g-r and i-z color-color plots using the true stars in the round2 training sample. The resulting locus definitions appear as the solid black lines on the previous plots and in the table below.

g-r r-i
-0.37 -0.31
0.25 0.05
1.10 0.39
1.33 0.60
1.45 1.35
1.71 2.00
i-z r-i
-0.25 -0.3
0.26 0.46
0.48 1.14
0.88 2.00

Below I plot the distance of each object from each of the stellar loci. In addition, I plot the "Total Distance" defined as the distances from the two loci added in quadrature.

In addition, to "physical" distances in color space, I define a "relative" distance by dividing the distance from the stellar locus by the uncertainty in the object color (as calculated from the magnitude errors in each band summed in quadrature). Again, I create a "Total" relative distance from summing the two relative distances in quadrature.

Round 3

In validation, we compare general distributions of the classifier values in SVA1:
  • Slightly more uncertain classification of objects by TPZSG when generalizing to whole SVA1 area.
  • TPZ N(m) for stars has a kink related to LMC stars, to be investigated.
  • N(m) for galaxies have slightly weird distributions at bright magnitudes for lower redshifts (TPZ mean redshifts < 0.5), for both modest and TPZSG.
  • Star distribution shows the LMC overdensities.
Algorithm Output Run Time Documentation Notes
MODEST sva1_gold_1.0_catalog_basic.fits here galaxy-like = 1; star-like = 2
TPZ (magnitudes only) sva1_gold_1.0.2_catalog_TPZ_class_all_mag.fits parallelized here round 3 TPZ trained with mag_psf, and mag_model values; galaxy-like = 0; star-like = 1
TPZ (magnitudes + colors) sva1_gold_1.0.2_catalog_TPZ_class_all_color.fits parallelized here round 3 TPZ trained with mag_psf, mag_model, and colors from detmodel; galaxy-like = 0; star-like = 1
TPZ COSMOS (magnitudes only) sva1_gold_1.0.2_catalog_TPZ_class_cosmos_mag.fits parallelized here round 3 TPZ trained with mag_psf, and mag_model values on COSMOS field only; galaxy-like = 0; star-like = 1
TPZ COSMOS (magnitudes + colors) sva1_gold_1.0.2_catalog_TPZ_class_cosmos_color.fits parallelized here round 3 TPZ trained with mag_psf, mag_model, and colors from detmodel on COSMOS field only; galaxy-like = 0; star-like = 1
TPZ (magnitudes, colors, mag_psf-mag_model) sva1_gold_1.0_catalog_TPZ_class_all.fits parallelized here round 3 TPZ trained with mag_psf, mag_model, colors from detmodel, and mag_psf-mag_model; galaxy-like = 0; star-like = 1
BDT (mag_auto, spread_model) sva1_gold_1.0_catalog_model_DES_SGC_AUTO.fits here round 3 BDT trained with mag_auto, spread_model, spreaderr_model
BDT (mag_auto, spread_model) sva1_gold_1.0_catalog_model_DES_SGC_AUTO_COSMOS.fits here round 3 BDT trained with mag_auto, spread_model, spreaderr_model from cosmos field only

As with round 2, TPZSG classification (I would predict the rest as well) is more uncertain outside the training area. The plot below shows this behavior for the benchmark sample (SPTE with i<22.5 cut) vs the COSMOS area, where 67% of the training comes from.

The N(m) distributions for objects classified as stars/galaxies are shown below. The selection is done as follows:

Algorithm Stars Galaxies
TPZSG_CLR_ALLFIELDS (benchmark) >0.90 <0.089

Bear in mind that the catalog posted on Nov.12th [[des-photoz:TPZ_photo-z_PDF_and_N(z)_using_sparse_representation_SVA1_Gold|here]] for testing use these cuts.

Algorithm Stars Galaxies
TPZSG_PSF_ALLFIELDS (benchmark) >0.90 <0.14

TPZSG_ALL_CLR was chosen from ROC results (though arguably ALL_MAG does better) and is the only code that is available for the whole SVA1 area so it is showcased here. The galaxy cut was chosen to provide and overall 99% purity in the calibration sample. The star cut is somewhat arbitrary at the moment.

The strange kink that can be seen for stars at brighter magnitudes comes from the LMC area. There are fewer galaxies towards the faint end with the TPZ selection, but results indicate that they are a more pure sample. About a 30% net gain in correctly identifying stars out of the galaxy sample is achieved going from MODEST to TPZSG (according to the calibration sample).

The N(m) distributions as a function of TPZ photo-z bin (pre-Nov 2014 calibrations):

The strange features at low TPZ bins are to be investigated:

Matías Carrasco-Kind has produced the same histograms (TPZSG to be added later) where the difference of using the most probable value and the weight of the galaxy in that bin is showcased. You can check that part of the strange distribution goes away, now we have to check if this indeed has an impact on the clustering:

Below, the modest and TPZ galaxy and star densities in SPTE:

Below, the stellar loci of stars for modest and TPZ (thanks Edward!):

Summary of telecon (July 10th 2014)

What have we found
  • 5 codes have been run on SVA1, based on round 2 training: 2 flavors of BDT, 2 flavors of Random Forests, TPZ.
  • Machine learning methods seem more uncertain in assigning a class in SVA1 as whole wrt COSMOS (training set is 90% COSMOS). TPZ slightly less affected. Sample variance, extra depth of COSMOS, or specially good conditions of COSMOS could be playing a role in this.
  • Star number count distribution as a function of magnitude is very irregular for BDT and RF, not looking like the training set or the ones from other datasets. Up to some point TPZ is somewhat more robust. Modest keeps the shape.
  • Galaxy number count distribution as a function of magnitude does not seem so much affected (statistically the impact is smaller though, even if a contamination is there).
  • Eventually use multi-epoch spread_model information, as Erin is producing for WL.
Way forward: work towards round 3.
  • Alex to identify physically motivated, robust parameters to use (progress reported here). More contributions are welcome.
  • Chris will create new catalog with chisq of fit to templates including star templates.
  • Nacho will work towards creating the new training/test set using Eli's shallow coadds (currently testing and comparing them) and the new spectroscopic catalogs from Chris. Nacho: Possibly incorporate systematics map info (for plotting conditions of training set, maybe to noisy to train on that too).
  • Alex, Chris, Edward to eventually train on new round 3 training set and run on SVA1 Gold (maybe separate by seeing conditions, one that matches the training set?).