Trees for Photo-Z » History » Version 17

Edward Kim, 01/15/2014 01:26 AM

1 1 Edward Kim
h1. Trees for Photo-Z
2 1 Edward Kim
3 4 Edward Kim
h2. Introduction
4 4 Edward Kim
5 5 Edward Kim
6 5 Edward Kim
7 11 Edward Kim
Trees for Photo-Z ("TPZ": is a supervised machine learning, parallel algorithm that uses prediction trees and random forest techniques to produce both robust photometric redshift PDFs and ancillary information for a galaxy sample. A prediction tree is built by asking a sequence of questions that recursively split the input data taken from the spectroscopic sample, frequently into two branches, until a terminal leaf is created that meets a stopping criterion (e.g., a minimum leaf size or a variance threshold). The dimension in which the data is divided is chosen to be the one with highest information gain among the random subsample of dimensions obtained at every point. This process produces less correlated trees and allows to explore several configurations within the data. The small region bounding the data in the terminal leaf node represents a specific subsample of the entire data with similar properties. Within this leaf, a model is applied that provides a fairly comprehensible prediction, especially in situations where many variables may exist that interact in a nonlinear manner as is often the case with photo-z estimation.
8 4 Edward Kim
9 16 Edward Kim
TPZ is a supervised algorithm in the framework of Machine Learning for Photo-Z ("MLZ":, a machine learning software package that combines all of our recent photometric redshift algorithms and implementations. MLZ also includes a unsupervised method with self organizing maps and random atlas through "SOMz": For more information, refer to the Laboratory for Cosmological Data Mining website ("": at the University of Illinois at Urbana-Champaign.
10 9 Edward Kim
11 4 Edward Kim
h2. Initial Test
12 4 Edward Kim
13 1 Edward Kim
h3. File
14 12 Edward Kim
15 17 Edward Kim
Download attachment:tpz_r1.tar.gz to see the result of running our TPZ code on the provided test_sgchallenge_r1.fits file. It is a compressed ASCII file with RA, DEC, and probabilistic separator values ranging from 0 (galaxies) to 1 (stars).
16 12 Edward Kim
17 16 Edward Kim
h3. Details
18 12 Edward Kim
19 4 Edward Kim
In this initial test, we illustrate the capabilities of TPZ by using the following set of attributes:
20 4 Edward Kim
21 6 Edward Kim
* mag_model in g, r, i, z, y bands
22 4 Edward Kim
23 4 Edward Kim
* mag_psf in g, r, i, z, y bands
24 1 Edward Kim
25 11 Edward Kim
and their respective errors. For training, we require that mag_model and mag_psf be less than 99. We build a total of 500 trees by using 10 random realizations of 4 random attributes, each with 50 trees. The 500 trees vote to create a probabilistic classfication—if 480 trees vote galaxy and the remaining 20 vote star, we have a galaxy at 96% probability.
26 11 Edward Kim
27 15 Edward Kim
In the figure below, we show the completeness for the sources, classified at a probability of 96% or greater, as a function of magnitude in i band. Note that we also display the confusion matrix in the lower right corner.
28 14 Edward Kim
29 14 Edward Kim
30 14 Edward Kim
31 14 Edward Kim
We also plot the Receiver Operating Characteristic (ROC) curve. The area under the curve is 0.93.
32 1 Edward Kim
33 12 Edward Kim
34 11 Edward Kim
35 10 Edward Kim
h2. References
36 9 Edward Kim
37 11 Edward Kim
Carrasco Kind, M., & Brunner, R. J., 2013 “TPZ : Photometric redshift PDFs and ancillary information by using prediction trees and random forests”, MNRAS, 432, 1483 ("link":
38 11 Edward Kim
Carrasco Kind, M., & Brunner, R. J., 2014, “SOMz : photometric redshift PDFs with self organizing maps and random atlas” , MNRAS, in press ("link":