A Modest Proposal for Preliminary StarGalaxy Separation » History » Version 24

« Previous - Version 24/39 (diff) - Next » - Current version
Emmanuel Bertin, 11/25/2013 03:53 AM

A Modest Proposal for Preliminary Star/Galaxy Separation

The best thing to do for star/galaxy separation that gives a reasonably complete selection of galaxies is as-yet-undetermined for SVA1. Maayanne is working on her neural net which uses spread_model, colors, and all the available information which may simply solve our issues.

In the meantime, many people have been using a simple SPREAD_MODEL cut, but this is inadequate. At the bright end, many stars are getting misclassified as galaxies, and at the faint end unresolved galaxies are getting misclassified as stars. At the faint end the galaxy sample is relatively pure, but woefully incomplete.

Here I lay out a temporary stop-gap that is easy to implement. Suggestions welcome.

An easy-to-implement star/galaxy separation using SVA1 data

Updated 11/21/2013

Below are the details. But here is the topline.

A quick and dirty -- and much improved -- star/galaxy separation can be accomplished with the following selection:

(FLAGS_I <= 3) AND ((SPREAD_MODEL_I + 3*SPREADERR_MODEL_I) > 0.003) AND ((MAG_AUTO_I < 18.0 AND CLASS_STAR_I < 0.3) OR (MAG_AUTO_I > 16.0)) AND (MAG_PSF_I < 30.0)

By using the error in spread model we improve the selection of galaxies at the faint end tremendously. The CLASS_STAR cut at the bright end is because of a class of bright stars that have SPREAD_MODEL biased high but are obviously stars (see below).

I (Eli) have done some tests on Nacho's COSMOS truth catalog (yes, this is deeper than full depth...I will retest when the full depth equivalent catalog is available). First, the ACS flux radius vs mag_auto_i:

White points: stars in DES; blue points: galaxies in both ACS/DES; red points: erroneous stars in DES

Except at the faint end when we aren't resolving things, everything is actually looking very good. How does this look in terms of efficiency and purity? I've created two plots. One is a traditional SPREAD_MODEL_I cut, the other is using the error:

The efficiency is "true galaxies classified as such, over total true galaxies", and the purity is "true galaxies classified as such, over total objects classified
as galaxies".

With the straight spread_model cut the efficiency goes to crap at the faint end. But with the error cut then the efficiency looks quite nice, thank you very much!

There's still a problem of purity at the bright end. See Bright Galaxy Impurity. I think these are saturated stars that are getting misclassified as galaxies because they are mostly (but not all!) masked out (so the saturation flag isn't set!). I'm not sure if these are masked out by mangle, but they may be. TBD. But I believe these are the objects that are creating a kink in the n(mag) distribution that is apparent when comparing to DR8/Aardvark/etc.

And after implementing the MAG_PSF_I < 30.0 cut, the purity at the bright end improves:

Original info and details

The bright end

At the very bright end, the PSF is not perfect and the errors are very small. So the stars look a little bit elongated. The following plot shows spread_model_i vs mag_auto_i for a random sampling of objects in SVA1 (flags_i <= 3):

White points: all objects; blue points CLASS_STAR > 0.95; red dashed line spread_model_i < 0.002

The red dashed lines show a fiducial |spread_model_i| < 0.002 cut. (This maybe should be 0.003...?) (Nacho: agreed, we've been using abs(spread_model_i)>0.003 for some time in LSS).

The upturn at i<15.5 is obvious (which creates an artificial peak in the number counts of bright galaxies). Diego has pointed out that a surface brightness cut helps select these objects. However, at the bright end CLASS_STAR does a very good job (points in blue). Thus, I suggest at the bright end (i<17) anything is a star with CLASS_STAR > 0.95 (Nacho: agreed too, in LSS we have been using CLASS_STAR < 0.3 cut for galaxies to eliminate artifacts as well. The reluctance to adding CLASS_STAR was it's very irregular behavior from tile to tile, much worse than spread_model, in simulations. Maybe your test could be repeated with CLASS_STAR?).


Per Diego's suggestion, I have added a flux_radius comparison plot.

White points: all objects; red points CLASS_STAR_I > 0.3; blue points: SPREAD_MODEL_I < 0.003-3*SPREADERR_MODEL_I

Looks similar as above. The upturn in FLUX_RADIUS at the bright end is properly identified by CLASS_STAR and not by SPREAD_MODEL. However, there is also a population of objects marked as stars via SPREAD_MODEL_I that nevertheless have large FLUX_RADIUS_I. Visual inspection of a few of these shows that these are bright saturated stars where the saturation has been masked out in the finalcut processing...and for some reason they are not flagged as saturated by sextractor. I have not checked if these objects are properly flagged by the mangle masks...nevertheless, these are bad objects and spread_model is properly rejecting them.

The mid end

In the middle range, spread_model works pretty well (and agrees with class_star!). Here we can use a simple class_star cut (but see below!):

White points: all objects; blue points CLASS_STAR > 0.95; red dashed line spread_model_i < 0.002
White points: all; red: CLASS_STAR > 0.3; blue: SPREAD_MODEL 3 sigma

In this regime everything behaves very similarly and sanely.

The faint end

At the faint end, both CLASS_STAR and SPREAD_MODEL start to fail.

White points: all objects; blue points CLASS_STAR > 0.95; red dashed line spread_model_i < 0.002
Green points: objects with SPREAD_MODEL_I + SPREADERR_MODEL_I > 0.002

However, at the same time the error on spread_model (SPREADERR_MODEL_I) is increasing, showing the uncertainty on SPREAD_MODEL.

What I suggest is that something is tagged as a galaxy if (SPREAD_MODEL_I + SPREADERR_MODEL_I) > 0.002. That is, if it is consistent with being part of the galaxy selection within 1 sigma then it should be counted as a galaxy. (Nacho: I tested this in the past without finding much improvement in DC6B, though the proof from you and William is telling us otherwise in data. I used SPREAD_MODEL + n*SPREADERR_MODEL where I found n best between 2 and 3 (must be in doc-sb somewhere). Good you brought this back, I was lazy to test this again and I should have!).

This has the advantage of being a smooth, well motivated cut that applies at all magnitudes (at the bright end SPREADERR_MODEL_I is negligibly small). At i>23, basically all objects that were formerly classified as stars are now classified as galaxies (which we want!). But there is still a population of negative spread_model objects. Brief visual inspection shows that many (most? all?) of these are junk detections, and should be rejected.


White: all; red CLASS_STAR>0.3; blue SPREAD_MODEL nsig=3

In this case, virtually all (~95%) of objects with i>23 are classified as galaxies. But that doesn't necessarily mean the rest are stars! Visual inspection of a few of the significantly negative spread_model objects show that they are junk (cosmic rays, noise peaks, etc.). Thus, this is a useful cut.

But what about the large flux_radius objects that are also flagged via CLASS_STAR? Visual inspection shows that these are none too convincing either.

Bottom line: there could be a lot of junk at the faint end. These cuts are probably helping.


A quick and dirty galaxy selection (updated):

(FLAGS_I <= 3) AND ((SPREAD_MODEL_I + 3*SPREADERR_MODEL_I) > 0.003) AND ((MAG_AUTO_I < 17.0 AND CLASS_STAR_I < 0.3) or (MAG_AUTO_I > 16.5)))

Zooming out, this is what it looks like:

White points/red points: galaxies via spread model; red points: stars via CLASS_STAR; blue points: stars via SPREAD_MODEL.

Note that only the very bright red points selected via CLASS_STAR will be rejected in this classification.

(Nacho suggests a slight modification:

(FLAGS_I <= 3) AND ((SPREAD_MODEL_I + 2*SPREADERR_MODEL_I) > 0.003) AND ((MAG_AUTO_I < 17.0 AND CLASS_STAR_I < 0.3) or (MAG_AUTO_I > 16.5)))

1. SPREAD_MODEL_I more conservative to select galaxies, as you suggest.
2. Try a larger SPREADERR_MODEL factor.
3. Reduce CLASS_STAR threshold, as things in the intermediate range are usually things the NN messed up to classify.
In general I'm moving towards being more conservative, more pure in the galaxy sample, and that could me my personal LSS bias.

Concerning FLAGS:
It has been seen (Tommaso, myself) that spread_model is not enough to get rid of the stars in the LMC. However adding a flags 0 cut makes the overdensity in that region go away ... actually it turns into an underdensity. Basically we are cutting out objects which have been deblended with this additional cut. If you allow deblending (no cut in flags), then many of these new deblended objects which are really LMC stars will have large spread_model values due to non-PSF like shapes after the deblending process (just my guess!), thus making them 'galaxies'.

But the problem remains because this new cut in flags introduces a structure, the color based classifier (Maayane) is certainly the way to go. Or otherwise we must introduce a correction for this (for LSS))

(Eli comments here):

I think we have to give up on the LMC region. Doing extragalactic science behind it is pretty hopeless, we can't do SLR, our psf model is wrong (these blended stars made it into the star selection to build the psf model), etc.

I know that FLAGS 0, avoiding all deblended objects, is going to destroy the centers of clusters, so this is a no-go from the cluster perspective. (Though the more challenging objects are the blended galaxies that are not known to be blended: FLAGS == 0 even though HST imaging shows they aren't.)



I would still suggest to avoid the star/galaxy classification at magnitudes fainter than ~22, unless the science one has to do with the galaxy catalogue is not a statistical one. However one can cut the catalogue even more a posteriori and one can give people a suggestion on how to do it, especially at the faint end. Few comments:

  1. Eli, one could actually check where STAR_CLASS and SPREAD_MODEL classified objects lie in a i-band magnitude vs. FLUX_RADIUS_I (or FWHM...which can be obtained from FLUX_RADIUS and average seeing). This plot will show the stellar locus and one could actually assess how good stars and galaxies are classified by CLASS_STAR and SPREAD_MODEL (or SPREAD_MODEL+n*SPREADERR_MODEL). The star/galaxy selection done via magnitude vs. FLUX_RADIUS plot has been performed a lot in the literature. For instance it is used by the CFHTLS survey (an example cna be found 'here': I've done this myself on CFHTLS data for carrying out star/galaxy separation on their data. Maybe this plot could be an independent test to evaluate how well SPREAD_MODEL and CLASS_STAR perform?

Eli: This is a good point. I'll look at FLUX_RADIUS_I as well.

  1. From your bright and mid end plots, one can notice the agreement between STAR_CLASS and SPREAD_MODEL, but:
    • One can't see the actual numbers form these plots. It might be that points overlap in the plot but the actual numbers could be significantly different.
      A 2-D histogram could be useful to assess the agreement between CLASS_STAR and SPREAD_MODEL (so to speak, evaluating in magnitude and spread_model bins how many objects are classified as galaxies by STAR_CLASS as well). Or maybe it could be useful getting the magnitude distributions of objects selected according to a SPREAD_MODEL cut and to a STAR_CLASS cut and then taking the ratio of these 2 distribution, to check the actual agreement rate as a function of magnitude.
    • Even after knowing better the agreement between CLASS_STAR and SPREAD_MODEL, one wouldn't know if they are actually performing well.

Eli: True, but I don't trust's only a stopgap at the brightest end. I'm not going to sweat differences of CLASS_STAR and SPREAD_MODEL fainter since these are guaranteed to be position/seeing dependent.

Anyway, maybe a way to go could be evaluating the actual agreement between STAR_CLASS and SPREAD_MODEL (or SPREAD_MODEL+n*SPREADERR_MODEL) as a function of magnitude (as described in point a) above) to decide where to put a magnitude cut above which not to perform a star/galaxy separation (this should be above 21 where the relative number counts differ of orders of magnitude).

So I would actually propose to couple STAR_CLASS and SPREAD_MODEL for the bright end (as proposed by Eli), use SPREAD_MODEL (or SPREAD_MODEL+n*SPREADERR_MODEL following Eli+Nacho suggestions) in the mid end out to a magnitude limit which could be assessed via point a) above, and no star/galaxy separation above the magnitude limit identified, as above this limit it won't matter anymore if we include stars as well. One caveat could be artifacts at the faint end, but if the latter have very peculiar SPREAD_MODEL values, these could be cut without affecting the star and galaxy number counts at the faint end.

Eli: At the faint end, the objects with very negative SPREAD_MODEL are usually junk. So it's good to cut these. Meanwhile, if you look at the plot above if we move to n=2 or 3 *SPREADERR_MODEL I think that a hard cut at the faint end is unnecessary since most objects (except the bad outliers) will be selected as being consistent with galaxies. All this with a smooth evolution that will vary properly with depth/psf/etc. (At least to first order).

Possible galaxy extraction:

(FLAGS_I <= 3) AND (other general cuts, like cut of crazy colours and so on) AND (

(MAG_AUTO_I < 17.0 AND ((|SPREAD_MODEL_I| + 2*SPREADERR_MODEL_I) > 0.003) AND CLASS_STAR_I < 0.3/0.95) or

(17.0 < MAG_AUTO_I < mag_limit_to_be_found AND ((|SPREAD_MODEL_I| + n*SPREADERR_MODEL_I) > 0.003)) or

(MAG_AUTO_I>mag_limit_to_be_found) )

One could cut more at the faint end a posteriori if need be and according to science needs.

What do you think?

Eli: Note that you don't want to do |SPREAD_MODEL_I| with the absolute values...very negative values are bad!

Diego: Yep, you're right!

Emmanuel: just a small note. There are three main sources of dispersion in SPREAD_MODEL measurements:

  1. Photon + readout noise (which is actually what SPREADERR_MODEL measures); it is Gaussian to a fairly good approximation.
  2. PSF model errors; this includes interpolation inaccuracies, "frozen" turbulence residuals and the brighter-fatter effect.
  3. Confusion noise caused by overlapping sources.

I believe that component # 1 is pretty well understood and under control. Component # 2 becomes significant only for bright stars; the brighter-fatter effect can easily be modeled and should not be a concern on the long term. Confusion noise is heavily skewed in the regime of DES observatins, and depends somewhat on the quality of the deblending (although tests conducted on SExtractor 3 so far haven't indicated much improvements).