SG separation challenge

Purpose and status as of July 20th 2014:

Rounds 1 and 2 have served to verify that we could do better than standard DESDM classifiers, at least in the fields we have trained on. We are moving beyond these fields, in particular to SPTE, and applied some tests to understand the quality of the classification without truth values. We have found some puzzling behavior, specially for the stars. Before providing these catalogs to the collaboration, we have to understand these features. So round 3 will center on having calibration fields closer to DES survey characteristics, larger spectroscopic samples, including more stars, as well having an eye on the particular observing conditions of the training fields and check whether those regions in SPTE with similar conditions have an expected behavior. The goal is to provide a classifier(s) in the short term that is well backed up by plots/results from this challenge, showing their behavior.

Details and results

Now that several people are testing their own approaches:

  • Cut-based with DESDM info (Eli, Diego, Nacho, Ryan, William...) --> Modest.
  • Multi-class (Maayane)
  • Boosted Decision Trees (Nacho, Alex)
  • Alternative Neural Network with probabilistic output (Chris Bonnett).
  • Probability based on spread model and photometry (DES-Brazil)
  • Random Forests (Ryan)
  • Others...

I think the time is right and the codes are mature to launch a specific SG separation challenge, mimicking the successful photo-z WG exercise.

We have to establish:

  • The training/validation/testing sample (COSMOS, others).
    I have prepared a 70/30 training/testing with the deep COSMOS field matched to ACS imaging. About 280 parameters, up to each tester to choose which.
    Besides new datasets, also consider shallower COSMOS. Also consider fixed set of parameters as Eduardo suggests. Also need to add SLR corrections though I think not very important now.
  • Only stars and galaxies? What about QSOs, image artifacts?
    Star/galaxy for round 1.
  • The metrics (Fixed cut, Fixed purity, Fixed Efficiency, ROC -- see example below).
    I would prefer to use ROC, i.e., True Positive Rate vs False Positive Rate curve formed changing the threshold (thanks Alex for pointing out mistake in previous ROC!).
  • SVA1 systematics: correlations with depth, Galactic latitude, seeing, etc.
  • Who/how to run it.
    I suggest each group providing an output file with id (or ra,dec on first round) plus galaxy probability or binary value.
  • Is there any gain combining them (a committee)?
  • The schedule.

Comparison metrics

There are a number of metrics that can be used for comparing the performance of classifiers. Some especially useful metrics are those defined in the DES star/galaxy separation (on simulation) paper arXiv:1306.5236 and the receiver operating characteristic generally used for classifier comparison.

Completeness and Purity provided by a given classifier

We define the parameters used to quantify the quality of a star/galaxy classifier. For a given class of objects, X (stars or galaxies), we distinguish the surface density of properly classified objects, N_X , and the misclassified objects, M_X .

  • The galaxy completeness c^g is defined as the ratio of the number of true galaxies classified as galaxies to the total number of true galaxies.
  • The stellar contamination f_s is defined as the ratio of stars classified as galaxies to the total amount of objects classified as galaxies.
  • The purity p^g is defined as 1-f_s

Bellow are three different plots we suggest to use to assess the performances of each classifier.


Example, on simulations, from arXiv:1306.5236

purity as a function of magnitude (for fixed completeness, the threshold/cut is let free)

completeness as a function of magnitude (for fixed purity, the threshold/cut is let free )

Receiver operating characteristics

The receiver operating characteristic (ROC) provides another tool for evaluating the performance of classifiers. The ROC provides some information orthogonal to that in the completeness vs purity plots:

  • Because ROCs compare the true positive rate to the false positive rate, they do not depend on relative composition of the test sample. Thus, unlike the purity, they contain information only about the intrinsic performance of the classifier and not the test sample.
  • ROCs allow classifiers to be compared without requiring a threshold/cut to be placed on the output. This is useful because different projects possess different requirements on object sample, completeness, purity, etc. The area under the ROC can serve as a very high-level scalar metric for classifier performance.
  • Once a threshold/cut is placed, we can generate magnitude dependent true positive vs false positive rate plots which would be intrinsic to the classifiers.

Summary of telecons

July 10th 2014

What have we found
  • 5 codes have been run on SVA1, based on round 2 training: 2 flavors of BDT, 2 flavors of Random Forests, TPZ.
  • Machine learning methods seem more uncertain in assigning a class in SVA1 as whole wrt COSMOS (training set is 90% COSMOS). TPZ slightly less affected. Sample variance, extra depth of COSMOS, or specially good conditions of COSMOS could be playing a role in this.
  • Star number count distribution as a function of magnitude is very irregular for BDT and RF, not looking like the training set or the ones from other datasets. Up to some point TPZ is somewhat more robust. Modest keeps the shape.
  • Galaxy number count distribution as a function of magnitude does not seem so much affected (statistically the impact is smaller though, even if a contamination is there).
  • Eventually use multi-epoch spread_model information, as Erin is producing for WL.
Way forward: work towards round 3.
  • Alex to identify physically motivated, robust parameters to use (progress reported here). More contributions are welcome.
  • Chris will create new catalog with chisq of fit to templates including star templates.
  • Nacho will work towards creating the new training/test set using Eli's shallow coadds (currently testing and comparing them) and the new spectroscopic catalogs from Chris. Nacho: Possibly incorporate systematics map info (for plotting conditions of training set, maybe to noisy to train on that too).
  • Alex, Chris, Edward to eventually train on new round 3 training set and run on SVA1 Gold (maybe separate by seeing conditions, one that matches the training set?).

December 18th 2014

Present: Alex, Chris, Edward, Maayane, Nacho

- Nacho: review of end of year situation. Round 3 says that ML codes are better than SExtractor's, including weighted average spread_model variant. However, results are similar in terms of purity and completeness in stripe 82 area. See details page and attached presentation (XXX_Argonne.pdf). A lot of tests of impact in a particularly sensitive science case with SVA1 data: determination of bias.
There is indeed an impact of 1-3 sigma on certain photo-z bins, vs modest_class.
- Maayane: suggests using pre-processed inputs via PCA or similar. Good results with Chris's Random Forests.
- Alex: should figure out before fine-tuning too much if we can sacrifice some performance for good generalization.
- Chris: Generalization to SPTE area is not obvious, specially in LMC (certain star colors may not be represented at all). See plots attached to the page (g_r_XXX.png and i_z_XXX.png).
- Solutions can go from reweighting at the training level to adapt to application color space, to some sort of 'prior' approach in which we take into account the position of sources (e.g. if near LMC, use only morphometry), adding simulated or LMC datasets to the training.
- Chris: some ideas for the future, 4th moment information from object images (KIDs), SG from images, not catalogs (BTW, in fact spread_model has started pioneering this! Also I know Robert B has been working on it with a student here).

Not commented on telecon:
  • New COSMOS Y1 and Y5 reruns soon
  • I think we are sort of converging towards using a PCA+random forest and/or TPC approach for all of these tests. Do you guys agree?

- Short term (till January): new TPC calibration (colors and weighted average spread_model) on COSMOS + spectroscopic fields, test on stripe 82. Make some tests to see if it can replace spread_model in bias paper:

  • Purity and completeness as function of photoz * Star/galaxy ratio on SPTE as function of photoz * Color-Color plots of calibration vs spte and stripe 82 vs spte. * (your idea here)

I should look at your round 3 submissions.

- Mid term (early April):

  • Fully develop the tests and procedures for SG classifier test, upload to repository. * Attack the color representativeness issue * Can we provide a Bayesian probabilistic output? * Catalog for collaboration * Paper!

- Long term:

  • Further improvements, Bayesian, 4th moment * QSOs, what has been done with this within the collaboration