Variables for SG Separation¶
- Table of contents
- Variables for SG Separation
Contact: Alex Drlica-Wagner <email@example.com>
We would like to identify a set of simple, robust variable combinations for star galaxy separation. These variables can either be passed into multivariate classification algorithms, or they can be used to cross-check multivariate output. While multivariate classifiers should be able to build powerful variable combinations from a large set of raw inputs, injecting some physical intuition doesn't seem like a bad thing.Some ideas for variable combinations:
- Spread Model -- This is the basis for the modest classifier using spread_model + X*spreaderr_model with X = 3. However, there is a noticeable magnitude dependence of the spread model distribution for mag_auto_i > 22. We may play with adjusting X or the star selection cut as a function of magnitude. Another interesting idea might be to take the "best" spread_model from across the bands.
- Class Star -- Modest class uses it at bright (<18) magnitudes and performs well there.
- Surface Brightness -- Compare the object magnitude to it's peak surface brightness (mag_auto/mu_max). This type of variable combination was used for the COSMOS ACS catalog.
- Magnitude Model Dependence -- This compares the objects magnitude when fit with a extended source model and a psf model (mag_model - mag_psf). This type of combination was used by SDSS for star-galaxy classification. Nacho: as Eli commented, in principle, this is a noisier version of spread_model. Also in simulations did some tests and was somewhat underperformant. But is easy to check. Note that SDSS uses cmodel which is equivalent to DETMODEL rather than MODEL from DESDM.
- Stellar Locus Distance -- The idea here is to calculate the color-color distance between objects and the stellar locus. This is a dangerous variable to include if searching for strange stars.
- PSF Scaled Flux Radius -- Flux radius has been seen to be a powerful variable but is very PSF dependent. Could we scale flux_radius by measured PSF?
- Chi-square to templates and template number -- Goodness-of-fit to LePhare galaxy and star templates, as provided by Christopher Bonnett and Carles Sanchez. NOT IN SVA1 GOLD YET
- Seeing from systematics maps. Maybe some sort of combination of different bands.
One could also think of adding weight to colors at faint magnitudes, or not using shape information at all at mag>22.
Spread model is "a normalized simplified linear discriminant between the best-fitting local PSF model and a slightly more extended model made from the same PSF convolved with a circular exponential disk model with scale length = FWHM/16" (Desai et al, 2012). When performing properly, the spread_model distribution will have a stellar locus centered around a value of zero. Since spread_model incorporates the local determination of the PSF, it should be more robust against seeing variations across the field. However, as has been noted elsewhere (e.g., Nacho, Eli, etc.), the spread_model stellar locus significantly broadens at fainter magnitudes.
The robustness of the MODEST classifier has shown that to large extent the spread_model distribution is not too dissimilar between the COSMOS training sample and the SVA1 field. Thus, we use spread_model to begin our investigation of reliable classification variables. We plot the spread model distribution as a function of magnitude in the 4 primary DES bands. We have randomly downsampled the SVA1 catalog to contain the same number of objects (~2e5) as the training set (referred to as COSMOS+).
The first obvious and well known feature from the above plots is that the COSMOS observations are significantly deeper than SVA1. To correct for this, we remake the plots clipping the distributions at mag_auto < 24 (looking at the systematics maps from Boris, the magnitude limit of SPTE is between 24 and 25 depending on band). This cut at mag_auto < 24 should not be necessary once Eli's survey depth coadds are being used for training. However, for the time being some type of cut like this should probably be considered.
- Overall, the agreement between COSMOS+ and SVA1 looks very good.
- There is a dearth of bright (mag_auto < 17) stars in the COSMOS+ field compared to SVA1. This could be related to the increased depth of COSMOS+ leading to stellar saturation at larger magnitudes.
- The spread_model locus is slightly broader in the SVA1 sample. This may be due to incomplete correction for the worse PSF.
- The density of stellar objects is higher in SVA1. This is likely due to the LMC stellar population.
- The COSMOS+ stars have a slight upturn in spread_model_i at mag_auto_i ~ 17. This feature is not seen in SVA1.
We now investigate spread_model + 3*spreaderr_model, which gives a more pure galaxy sample at the expensive a less complete faint stellar sample.
This variable also appears fairly well behaved near the stellar locus; however there are larger disagreements around the upper edge of the galaxy clump. This discrepancy is probably not too important for the classifiers, since it is far from the stellar locus. However, this suggests that spreaderr_model may be more sensitive to the varying PSF between the two samples. The upturn in the spread+3*error distribution can be seen to begin to start around mag_auto = 23 and is most pronounced in the z-band.
More info on class_star... While the class_star distribution appears to be sensitive to the differences between SVA1 and COSMOS+, the differences in the distribution appear to be most obvious in the intermediate class_star values at fainter magnitudes (mag_auto > 21). The deeper and cleaner observations of COSMOS+ push these intermediate values back from mag_auto ~ 21.5 to mag_auto ~ 22.5. However, the shape of the class_star distribution around the stellar locus appears to be quite similarly behaved between the two samples. The stellar locus grows wider at large magnitudes without extreme contamination from the galaxy distribution; thus, a class star cut may benefit from a magnitude-dependent correction. As a stupid example, class_star > 0.95 - 0.05*(mag_auto_i-20)*(mag_auto_i > 20) (a linear correction for mag auto greater than 20).
The COSMOS ACS observations found that the maximum surface brightness, mu_max (mag/arcsec^2), provided a more reliable star/galaxy classification than class_star. Of course, Hubble doesn't need to worry about variable seeing conditions, and we expect mu_max to be heavily dependent on the PSF.
We can see that indeed there is a tight stellar locus in mu_max that extends more or less linearly from (17 < mag_auto < 22) in the COSMOS+ field. The width of this locus is much broader in the SVA1 sample. This can be seen even more clearly when mu_max - mag_auto is plotted.
- The mu_max variables shows strong discrepancies between COSMOS+ and SVA1
- Specifically, the SVA1 stellar locus is much broader and is systematically shifted to lower mu_max values (possible feature of depth?).
- It may be possible to correct mu_max with the measured FWHM. The first thought is to multiply by fwhm**2, but the success of this scaling remains to be seen.
Magnitude Model Dependence¶
SDSS used the difference between mag_psf and mag_model as a star galaxy classifier (specifically psfMag - cmodelMag > 0.145). As has been noted before, this is similar in concept to the spread_model calculations, but is more simplistic and likely noisier. However, this the simplicity of this metric may lead to better agreement between the COSMOS+ and SVA1 samples, so it may be worth investigating. Below we plot the difference between mag_psf - mag_detmodel (which Nacho points out is the DESDM equivalent of SDSS cmodel).
It looks like mag_psf - mag_detmodel has some pretty serious issues in the COSMOS+ field. Under the suspicion that this could be an issue with detmodel, we also plot mag_psf - mag_model. This variables appears much better behaved.
- On one hand mag_psf - mag_detmodel has some serious issues in the COSMOS+ field.
- On the other hand, mag_psf - mag_model appears well behaved and agrees well between COSMOS+ and SVA1.
- In fact, the stellar locus for both COSMOS+ and SVA1 distributions appears to be quite tight up to mag_auto = 23.
- There are some minor discrepancies between the COSMOS+ and SVA1 distribution in the upper periphery of the galaxy clump and an upturn in the stellar locus in i-band (similar to spread_model).
- Overall, mag_psf - mag_model looks like a promising variable.
Nacho: maybe we could try this one out instead of spread_model, and see what happens.
Stellar Locus Distance¶
The stellar locus is well defined in color-color space (usually r-i vs g-r and r-i vs i-z). The idea is to use the scalar distance to a roughly defined locus location as a S/G classifier. The difficulty here is that the stellar locus cuts through the heart of the galaxy distribution. Additionally, a classifier trained on such a variable would have a selection bias against stars lying off the main sequence.
|Stellar Locus (g-r)||Stellar Locus (i-z)|
While this variable is probably not ideal as a classifier input, the color information can be used to cross-validate a classifier derived from spatial information.
Scaled Flux Radius¶
In the early rounds of the S/G classification challenge, flux_radius was found to be a powerful variable. However, like the peak surface brightness (mu_max), flux_radius should be PSF dependent. We find that this is indeed the case.
In an attempt to correct for this PSF dependence, we scale flux_radius by the PSF fwhm_mean as calculated from the SVA1 gold systematic maps. We find that this improves the agreement between the two data samples.
- The flux_radius variable should not be used without a PSF correction.
- By correcting flux_radius with the PSF fwhm_mean, reasonable agreement is achieved for the g and z bands. The other bands still show systematic offsets in their distributions.
- The PSF correction leads to an interesting bifurcation in the stellar locus in the i and z band for the COSMOS+ data set.
- It appears that g-band is the safest band for using flux_radius.
Christopher Bonnett made available a dataset containing the chi squared of template fits of each object to star and galaxy templates. He suggests that these variables may be useful for star/galaxy separation. The output chisq spans a large range of values, so instead I plot the logarithm. Somewhat oddly, there doesn't appear to be strong star/galaxy separation power to these variables. NOTE: The chisq is calculated from a multi-band fit and represents a composite variable (one chisq per object, not one chisq per band). Thus, the different frames of the plot show only changes arising from the band chosen for mag_auto.chisq_star