SG separation challenge » History » Version 31
Ignacio Sevilla, 12/18/2014 09:33 AM
h1. SG separation challenge
Purpose and status as of July 20th 2014:
Rounds 1 and 2 have served to verify that we could do better than standard DESDM classifiers, at least in the fields we have trained on. We are moving beyond these fields, in particular to SPTE, and applied some tests to understand the quality of the classification without truth values. We have found some puzzling behavior, specially for the stars. Before providing these catalogs to the collaboration, we have to understand these features. So round 3 will center on having calibration fields closer to DES survey characteristics, larger spectroscopic samples, including more stars, as well having an eye on the particular observing conditions of the training fields and check whether those regions in SPTE with similar conditions have an expected behavior. The goal is to provide a classifier(s) in the short term that is well backed up by plots/results from this challenge, showing their behavior.
*[[des-sci-verification:SG_separation_challenge_details|Details and results]]*
Now that several people are testing their own approaches:
* Cut-based with DESDM info (Eli, Diego, Nacho, Ryan, William...).
* Multi-class (Maayane)
* Random Forests (Ryan)
* Boosted Decision Trees (Nacho, Alex)
* Alternative Neural Network with probabilistic output (Chris Bonnett).
* Probability based on spread model and photometry (DES-Brazil)
I think the time is right and the codes are mature to launch a specific SG separation challenge, mimicking the successful photo-z WG exercise.
We have to establish:
* The training/validation/testing sample (COSMOS, others).
I have prepared a 70/30 training/testing with the deep COSMOS field matched to ACS imaging. About 280 parameters, up to each tester to choose which.
Besides new datasets, also consider shallower COSMOS. Also consider fixed set of parameters as Eduardo suggests. Also need to add SLR corrections though I think not very important now.
* Only stars and galaxies? What about QSOs, image artifacts?
Star/galaxy for round 1.
* The metrics (Fixed cut, Fixed purity, Fixed Efficiency, ROC -- see example below).
I would prefer to use ROC, i.e., True Positive Rate vs False Positive Rate curve formed changing the threshold (thanks Alex for pointing out mistake in previous ROC!).
* SVA1 systematics: correlations with depth, Galactic latitude, seeing, etc.
* Who/how to run it.
I suggest each group providing an output file with id (or ra,dec on first round) plus galaxy probability or binary value.
* Is there any gain combining them (a committee)?
* The schedule.
h1. Comparison metrics
There are a number of metrics that can be used for comparing the performance of classifiers. Some especially useful metrics are those defined in the DES star/galaxy separation (on simulation) paper "arXiv:1306.5236":http://arxiv.org/abs/1306.5236 and the "receiver operating characteristic (ROC)":http://en.wikipedia.org/wiki/Receiver_operating_characteristic generally used for classifier comparison.
h2. Completeness and Purity provided by a given classifier
We define the parameters used to quantify the quality of a star/galaxy classifier. For a given class of objects, X (stars or galaxies), we distinguish the surface density of properly classified objects, N_X , and the misclassified objects, M_X .
* The galaxy completeness c^g is defined as the ratio of the number of true galaxies classified as galaxies to the total number of true galaxies.
* The stellar contamination f_s is defined as the ratio of stars classified as galaxies to the total amount of objects classified as galaxies.
* The purity p^g is defined as 1-f_s
Bellow are three different plots we suggest to use to assess the performances of each classifier.
Example, on simulations, from arXiv:1306.5236
h3. purity as a function of magnitude (for fixed completeness, the threshold/cut is let free)
h3. completeness as a function of magnitude (for fixed purity, the threshold/cut is let free )
h2. Receiver operating characteristics
The receiver operating characteristic (ROC) provides another tool for evaluating the performance of classifiers. The ROC provides some information orthogonal to that in the completeness vs purity plots:
* Because ROCs compare the true positive rate to the false positive rate, they do not depend on relative composition of the test sample. Thus, unlike the purity, they contain information only about the intrinsic performance of the classifier and not the test sample.
* ROCs allow classifiers to be compared without requiring a threshold/cut to be placed on the output. This is useful because different projects possess different requirements on object sample, completeness, purity, etc. The area under the ROC can serve as a very high-level scalar metric for classifier performance.
* Once a threshold/cut is placed, we can generate magnitude dependent true positive vs false positive rate plots which would be intrinsic to the classifiers.
h2. Summary of telecons
h3. July 10th 2014
What have we found
* 5 codes have been run on SVA1, based on round 2 training: 2 flavors of BDT, 2 flavors of Random Forests, TPZ.
* Machine learning methods seem more uncertain in assigning a class in SVA1 as whole wrt COSMOS (training set is 90% COSMOS). TPZ slightly less affected. Sample variance, extra depth of COSMOS, or specially good conditions of COSMOS could be playing a role in this.
* Star number count distribution as a function of magnitude is very irregular for BDT and RF, not looking like the training set or the ones from other datasets. Up to some point TPZ is somewhat more robust. Modest keeps the shape.
* Galaxy number count distribution as a function of magnitude does not seem so much affected (statistically the impact is smaller though, even if a contamination is there).
* Eventually use multi-epoch spread_model information, as Erin is producing for WL.
Way forward: work towards round 3.
* Alex to identify physically motivated, robust parameters to use ([[des-sci-verification:Variables_for_SG_Separation|progress reported here]]). More contributions are welcome.
* Chris will create new catalog with chisq of fit to templates including star templates.
* Nacho will work towards creating the new training/test set using Eli's shallow coadds (currently testing and comparing them) and the new spectroscopic catalogs from Chris. Nacho: Possibly incorporate systematics map info (for plotting conditions of training set, maybe to noisy to train on that too).
* Alex, Chris, Edward to eventually train on new round 3 training set and run on SVA1 Gold (maybe separate by seeing conditions, one that matches the training set?).
h3. December 18th 2014
- Round 3 submitted. SVA1 analyzed for modest, weighted average, TPC. Not for BDT, Chris's codes.
- Tested TPC on stripe 82 with round 3 training vs modest, spread_model, weighted average spread_model.
- Tested on correlation functions in SVA1-SPTE area.
- Pending issues:
* Doing better than spread_model in stripe 82.
* Dealing with LMC stars.
* Probabilistic output.
Where to go from here:
- Final round? Which tests/calibration?
- Color representativeness
- Settle on procedure to automate for forthcoming years.
- If results good --> paper later next year.