Project

General

Profile

Machine Learning for Photo-Z

Machine Learning for photo-Z (MLZ) is a parallel Python framework that computes fast and robust photometric redshift probability density functions (PDFs) using Machine Learning algorithms. In particular, it uses a supervised technique with prediction trees and random forest through Trees for Photo-Z (TPZ) or a unsupervised method with self organizing maps and random atlas through Self-Organizing Maps for Photo-Z (SOMz). It can be easily extended to other regression or classification problems. For more information, refer to the Laboratory for Cosmological Data Mining website (lcdm.astro.illinois.edu) at the University of Illinois at Urbana-Champaign.

MLZ is a parallel, Python-based code and can be easily integrated into existing Python-based projects. While other techniques provide only a single estimate, our approach provides a full probability for each source. This probability can be used in a subsequent analysis to improve or enhance a particular cosmological measurement analysis.

Round 1

TPZ

Trees for Photo-Z (TPZ) is a supervised machine learning, parallel algorithm that uses prediction trees and random forest techniques to produce both robust photometric redshift PDFs and ancillary information for a galaxy sample. A prediction tree is built by asking a sequence of questions that recursively split the input data taken from the spectroscopic sample, frequently into two branches, until a terminal leaf is created that meets a stopping criterion (e.g., a minimum leaf size or a variance threshold). The dimension in which the data is divided is chosen to be the one with highest information gain among the random subsample of dimensions obtained at every point. This process produces less correlated trees and allows to explore several configurations within the data. The small region bounding the data in the terminal leaf node represents a specific subsample of the entire data with similar properties. Within this leaf, a model is applied that provides a fairly comprehensible prediction, especially in situations where many variables may exist that interact in a nonlinear manner as is often the case with photo-z estimation.

File

Download tpz_r1.tar.gz to see the result of running our TPZ code on the provided test_sgchallenge_r1.fits file. It is a compressed ASCII file with RA, DEC, and probabilistic separator values ranging from 0 (galaxies) to 1 (stars).

Details

In this initial test, we illustrate the capabilities of TPZ by using the following set of attributes:

  • mag_model in g, r, i, z, y bands
  • mag_psf in g, r, i, z, y bands

and their respective errors. For training, we require that mag_model and mag_psf be less than 99. We build a total of 500 trees by using 10 random realizations of 4 random attributes, each with 50 trees. The 500 trees vote to create a probabilistic classfication—if 480 trees vote galaxy and the remaining 20 vote star, we have a galaxy at 96% probability.

At a 50% probability cut, 66,361 sources are classified as galaxies out of 68,700 sources, where the completeness (galaxy classified as galaxy) is 99.9% and the purity (1 minus contamination) is 95.5%. In the figure below, we show the completeness as a function of magnitude in i band at 50% probability cut.

At a 96% probability cut, we retain 63,807 galaxies, with 97.9% completeness and 97.3% purity. The figure below shows the completeness as a function of magnitude at 96% probability cut.

We also plot the True Positive Rate (galaxy classified as galaxy) vs. False Positive Rate (star classified as galaxy) Receiver Operating Characteristic (ROC) curve. The area under the curve is 0.93.

SOMz

Self-Organizing Maps for Photo-Z (SOMz) is an unsupervised machine learning technique that also computes photometric redshift PDFs. Specifically, we have developed a new framework that we have named random atlas, which mimics the random forest approach by replacing the prediction trees with self-organizing maps (SOMs). A SOM is essentially a neural network that maps a large training set via a process of competitive learning from a high dimensional input space to a two-dimensional surface. The mapping process retains the topology of the input data, thereby revealing potential unknown correlations between input parameters, which can provide important insights into the data.

This is an unsupervised learning method as no prediction attributes are included in the mapping process, only the non-prediction attributes are included. The output values from the training data are only used after the map has been constructed as they can be used to generate the prediction model for each cell in the map. In our implementation , we first construct a suite of maps that each use a random subset of the full attributes and the randomized training data we developed for the random forest, and we then aggregate the map predictions together to make our final prediction (via the random atlas).

File

Download somz_r1.tar.gz to see the result of running our SOMz code on the provided test_sgchallenge_r1.fits file.

Details

In this initial test, we illustrate the capabilities of SOMz by using the following set of attributes:

  • mag_model in g, r, i, z, y bands
  • mag_psf in g, r, i, z, y bands

and their respective errors. For training, we require that mag_model and mag_psf be less than 99. We build a total of 1,000 maps by using 10 random realizations of 6 random attributes, each with 100 maps. Each map consists of a rectangular grid with 2,500 cells (50 by 50) with non-periodic boundary conditions. The 1,000 maps vote to create a probabilistic classfication—if 960 trees vote galaxy and the remaining 40 vote star, we have a galaxy at 96% probability.

At a 50% probability cut, 67,046 sources are classified as galaxies out of 68,700 sources, where the completeness (galaxy classified as galaxy) is 99.9% and the purity (1 minus contamination) is 94.5%. In the figure below, we show the completeness as a function of magnitude in i band at 50% probability cut.

The figure below shows the plot of TPR vs FPR ROC curve for SOMz. We also show the TPZ ROC curve for comparison. For SOMz, the area under the curve is 0.894.

Round 2

TPZ

In Round 2, we were provided with a training set (60%), a validation set (20%), and a "blind" testing set (20%). The figure below illustrates the distribution of stars and galaxies in the training set in different magnitude bins. We expect similar distribution of sources in the blind testing set.

In addition, we were provided with three additional data sets corresponding to different magnitude ranges (mag_auto < 21, 21 < mag_auto < 23, 23 < mag_auto < 24 ). Although it would be possible to train separate prediction trees for each magnitude bin, we chose to build a global forest using the training set that covers the entire magnitude range. Thus, we used only round2_training_set.fits file to build a total of 1,000 trees by using 10 random realizations of 6 random attributes, each with 100 trees. We did not use round2_test_blind_set_auto_i_*.fits files for training.

In Round 2, we used the following attributes to train the trees:

  • mag_model in g, r, i, z, y bands
  • mag_psf in g, r, i, z, y bands
  • mag_auto in g, r, i, z, y bands
  • spread_model in g, r, i, z, y bands

We also included colors obtained from the above magnitude attributes and also mag_psf-mag_model. Objects with questionable colors, such as (g-r) < -1, (g-r) > 4, (i-z) < -1, and (i-z) > 4, were not used in training.

After training, we used the provided validation set to evaluate the performance of TPZ. The next figures shows the ROC curve of TPZ. For comparison, we also show the ROC curve of the "modest" classifier, which is simply spread_model + 3 * spreaderr_model.

For the validation set, the area under the ROC curve and completeness/purity are:

Classifier ROC area Completeness at 95% purity Completeness at 99% purity Purity at 95% completeness
Modest 0.892 99.3% 64.7% 97.9%
TPZ 0.959 99.9% 90.8% 98.7%

TPZ produces the probability that an object is either a star or a galaxy. One of the great advantages of a probabilistic classification is that one can adjust the threshold to obtain more/less complete or pure samples, depending on one's science goal. Next two figures show the completeness of galaxies (left in green) and stars (right in red), where we chose the threshold of P(galaxy) > 0.90. The thin black line indicates the sample fraction for stars/galaxies. At this probability, it is obvious that TPZ outperforms the "modest" classifier across all magnitude ranges.

Similar to the figures above, we also show the purity of stars/galaxies below.

References

Carrasco Kind, M., & Brunner, R. J., 2013 “TPZ : Photometric redshift PDFs and ancillary information by using prediction trees and random forests”, MNRAS, 432, 1483 (link)
Carrasco Kind, M., & Brunner, R. J., 2014, “SOMz : photometric redshift PDFs with self organizing maps and random atlas” , MNRAS, in press (link)
Carrasco Kind, M., & Brunner, R. J., 2013, “Implementing Probabilistic Photometric Redshifts”, in Astronomical Society of the Pacific Conference Series, Vol. 475, Astronomical Data Analysis Software and Systems XXII (ADASSXXII), Freidel D., ed., p. 69 (link)