Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

constructing validation test for DC2 dN/dmag #21

Closed
rmandelb opened this issue Oct 25, 2017 · 42 comments
Closed

constructing validation test for DC2 dN/dmag #21

rmandelb opened this issue Oct 25, 2017 · 42 comments

Comments

@rmandelb
Copy link
Contributor

Question for @janewman-pitt-edu @damonge and anybody else who is interested:

There is an interest in making sure that DC2 has a galaxy population that is complete beyond the flux limit of LSST, to ensure that realistic levels of blending are included. We would like to do this validation test on proto-DC2, comparing dN/dmag against some realistic reference using DESCQA. The question is what is the reference sample we should use?

Currently if we were to compare the DC2 population with, e.g., the COSMOS catalog, we can only test to i~25.2. The purpose of this issue is to solicit suggestions for catalogs we could use to make a deeper validation test. I could imagine the extreme case would be Hubble UDF, but that is so small that it should have significant cosmic variance. There are some intermediate-depth and area HST surveys (deeper than COSMOS but smaller area, while shallower and larger area than UDF). We could explore those?

The CatSim page has a comparison against data going to 27th magnitude: https://www.lsst.org/scientists/simulations/catsim but the plot caption does not say where the validation data comes from.

@rmandelb rmandelb added the CS label Oct 25, 2017
@egawiser
Copy link

CANDELS offers public catalogs with significantly increased area vs. HUDF. I haven't plotted HUDF number counts, but since we're effectively integrating along the entire line-of-sight and (I think) just looking for a bias in DC2 catalog number counts versus data in broad bins of object brightness, we should be relatively robust to cosmic variance.

@janewman-pitt-edu
Copy link

CANDELS volume is still tiny, and sample/cosmic variance is substantial. It's enough to affect number counts (and even more so redshift distributions) probably at the ~10% level. Maybe we're fine with that for this test. How deep does HSC go?

@janewman-pitt-edu
Copy link

Actually, I think you're probably better off validating against the existing much larger-area but a bit shallower surveys (e.g. HSC + DES deep fields) rather than the tiny area HST data. You'd just truncate the comparison where the existing data is fully complete (e.g. around the 10 sigma limit of the photometry). It's probably OK to assume that the number counts in actuality continue to rise geometrically a mag or so fainter than they are observed to do so in those datasets.

@egawiser egawiser reopened this Oct 25, 2017
@egawiser
Copy link

We should do whatever is easiest that offers the needed precision. I will guess however that if we were to require our simulated number counts beyond LSST depth (e.g., r~28) to match reality at anything close to the 10% level, it might take a decade for the DC2 catalogs to pass QA. The goal is "realistic levels of blending", so I would interpret that as wanting to be within a factor of 2 at such number counts and hoping to be off by only 50% (in either direction). Can anyone quantify the requirement more precisely?

@rmandelb
Copy link
Contributor Author

rmandelb commented Oct 25, 2017

Well, for starters, the total blend rate should scale like the square of the overall number density, shouldn't it? (at least in the limit that clustering is not a big contributor to the blend rate, which is true for LSST according to @dkirkby ) So if the overall number density is wrong by say 10%, the blend rate would be wrong by 20%.

But that's not precisely what we care about, in the sense that we care about the rate of blends as a function of flux ratio, so you could imagine thinking of the blend rate in a 2d plane defined by brighter vs. fainter galaxy mag. In a given bin in that plane, the blend rate depends linearly on the number densities on each axis. It seems like this becomes a requirement on the number density as a function of mag, which was why I was proposing looking at dN/dmag in the first place. But it's complicated because for example we might care more about similar flux ratio blends within the limit of the gold sample, and so getting dN/dmag right for something 1.5 mag fainter than that is less important (that becomes noise on the outskirts of the gold sample galaxies, which we do want to include in the sims, but it probably has a smaller impact than blends with something at the same mag). I would imagine we want dN/dmag to 10% down to r~26, but beyond that our requirements are quite a bit less stringent. 30%? I'm guessing at this point.

I believe it may have been one of @mdschneider @joezuntz @rmjarvis @esheldon who articulated the need to have a realistic blending rate in the DC2 sims for WL, so perhaps one of them can weigh in either on how realistic is good enough, or on a good validation dataset. (Michael, Joe, Mike, Erin - please read the first post in this thread for an explanation of what this is about)

@rmjarvis
Copy link

I don't know that we need to match it perfectly. The goal of DC2 is to test the ability of our code to handle the kinds of blends that are in real data. Not to precisely match the level of blending that we will see in some year of LSST data. In other words, it is a validation sim, not a calibration sim.

So I'd say a factor of two or better is good enough for our purposes. If it were really 10% squared = 20%, I'd be ecstatic.

@wadawson
Copy link

We showed in Dawson et al. (2016) that the number density of ambiguously blended galaxies grow proportional the number density of galaxies to the ~3rd power (not squared; see figure 1). In section 5.1 we note why this is greater than to the power of 2:

"Thus, when n is far from the confusion limit, the index α from Equation (19) will tend to be >2. It can be greater than 2 in part due to clustering, but to a larger degree it is because dP is a detect blend function of the deblending threshold that is usually defined with respect to the pixel level background. The number of low S/N galaxies increases as the limiting magnitude increases because the number density is observed to scale as n ~ 100.31(m-20) (Tyson 1988; Beckwith et al. 2006), where m is the limiting apparent z850 magnitude of the survey. Because it is more difficult to detect blends of low S/N objects, we should expect larger blend fractions with increasing survey limiting magnitude. Therefore α should be an increasing function of survey limiting magnitude until it levels off near the confusion limit."

Historically, in earlier versions of the paper and presentations to DESC we noted the impact of ambiguous blending on LSS number counts as an important systematic.

@rmandelb
Copy link
Contributor Author

@wadawson - yep, I was using an approximation that may not be correct for LSST, probably for multiple reasons.

Historically, in earlier versions of the paper and presentations to DESC we noted the impact of
ambiguous blending on LSS number counts as an important systematic.

Absolutely - that isn't really a matter of question here; I think that message has been pretty well internalized by the group. The only issue is how accurately do we need to represent the blend rate for DC2 given what the goal of DC2 is (which is not the same thing as asking how well we want to be able to represent it e.g. for dedicated blending studies).

@janewman-pitt-edu
Copy link

janewman-pitt-edu commented Oct 26, 2017

I think people are overcomplicating things here... in all the datasets I've seen, number counts rise geometrically with only a very very slow change in power law slope as a function of magnitude, until incompleteness sets in (see, e.g., Fig. 15 of https://academic.oup.com/mnras/article/423/3/1992/2459951/Galaxy-properties-from-the-ultraviolet-to-the-far ). This is true to at least i~29 in HST ACS data.

This means that one can analytically extend the magnitude distribution from a shallower (but wider, so low sample/cosmic variance) survey by a magnitude or two with much less than 10% uncertainty (I would suggest measuring the logarithmic derivative of dN/dm and its second derivative as well to see if the latter is significant).

@rmandelb
Copy link
Contributor Author

@janewman-pitt-edu - I agree with you about analytically extending something from a shallower survey, but I think you may have misunderstood the 10% (or whatever %) discussion. That wasn't a discussion of the error we'd allow on the validation data, which I agree we can easily know well. It was a discussion of the error we'd allow in DC2, i.e., if the number counts and therefore blend rates don't quite match those expected in reality, how far off are we OK with them being.

@janewman-pitt-edu
Copy link

Ah, thanks for clearing that up! (Oh, and I definitely agree with @dkirkby that the majority of blends at LSST depths are from very different redshifts, though I'm not certain what the fraction is exactly).

@rmandelb
Copy link
Contributor Author

rmandelb commented Nov 1, 2017

Update on this issue. As part of our SRD work, I used the publicly-released HSC deep survey layer to measure the dN/dmag (including absolute normalization and slope) in i-band. I also did a power-law fit to dlogN/dmag and the cumulative fits as a function of limiting mag, so we can extrapolate that, given that these power laws are generally pretty accurate across a wide range of magnitudes as Jeff said.

What I got from HSC Deep was a best-fitting number density (in units of arcmin^-2) of
N(<ilim) = 43 * 10^(0.36 (ilim - 25))
where ilim is the i-band magnitude limit of the sample. So, to be clear, these are cumulative number densities, not differential. If you want differential counts, it's just a multiplicative factor to convert.

Just to throw out numbers, it would be nice to agree with this at the 20% level at ilim=25.3 and 40% at ilim=27.

Would this work for a validation test?

@yymao
Copy link
Member

yymao commented Nov 2, 2017

@rmandelb I think that's great validation test that we can do. We can keep track of the status of these tests in #30

@rmandelb
Copy link
Contributor Author

rmandelb commented Nov 2, 2017

OK, so in the interest of not having too many open issues that duplicate each other, do you want to close this issue (now that we've decided what the test should be) and, in #30, link to the final comment with the power-law fit on this page?

@yymao
Copy link
Member

yymao commented Nov 2, 2017

@rmandelb sure. I don't have the permission to close issues in this repo though.

@katrinheitmann
Copy link
Contributor

katrinheitmann commented Nov 2, 2017 via email

@evevkovacs
Copy link
Contributor

evevkovacs commented Nov 2, 2017 via email

This was referenced Nov 3, 2017
@katrinheitmann katrinheitmann added this to the DC2 CS Freeze milestone Nov 8, 2017
@duncandc
Copy link

duncandc commented Dec 4, 2017

@rmandelb would you mind the sharing the code you used to calculate N(<i_lim) with the HSC data or point me to a useable catalog where I can calculate myself?

@rmandelb
Copy link
Contributor Author

rmandelb commented Dec 4, 2017

The code I used to calculate it is in another DESC repo, so I'm happy to share, but here's a question: do you want the actual N(<i_lim)? Or a power-law fit to it that describes the data well? I would think the latter might be better because then you don't have to worry about what is the histogram binning, etc.?

For reference, the fit is N(<i_lim) = 42.9 [10^(0.359 (i_lim-25))] per arcmin^2, shown as the black dashed curve on this plot (compared with real data in the blue solid curve):
image

The deficit at bright magnitudes is because of some selection criteria that downselect bright objects (i=22 is very bright for HSC) and at faint magnitudes, due to incompleteness. Obviously, beyond i~25.5 there is some major extrapolation happening.

@duncandc
Copy link

duncandc commented Dec 4, 2017

What I would like is some self-contained code that does the following:
1.) downloads necessary HSC data
2.) calculates the cumulative number density of galaxies in a set of bands
3.) interpolates the above calculation to get the cumulative number density on regularly sampled intervals
4.) fits a power law over some range
5.) defines an upper mag limit where the power law fit is used to extrapolate
5.) prints out a simple lookup ascii table with mag and N(<mag) for a set of bands that I can use for the test.

In principle, the above is something that someone can provide me, but I would feel better if I could follow it through the steps, make sure it is reproducible, and use as similar as possible code to create more lookup tables from different surveys.

@rmandelb
Copy link
Contributor Author

rmandelb commented Dec 4, 2017

OK, didn't realize you wanted to download the data. Just a few comments:

  • Your item (1) is going to depend a lot on the survey (what cuts/flags to use, best photometry to use, etc.)
  • In practice, I recommend going from the differential rather than cumulative counts. The reason for this is that at the bright end, you will often miss some objects due to cuts on e.g. interpolated pixels or CCD edge cuts or whatever (big bright objects cover more pixels), and if you use the differential counts then you can pick the mag range that is less affected. Then you can fit a power law, and then convert the power-law to cumulative counts. Do you see what I mean?

I'm going to separately e-mail you an SQL script for downloading the HSC data, and a python script that does the process I described above for HSC Deep i-band data. The latter is already in the Requirements repo which I believe you don't have access to. I'm happy for comments on any of it, or for the code to be put into this repo as needed.

@duncandc
Copy link

duncandc commented Dec 4, 2017

My first pass implementation idea is to simply use the cumulative number density at a single magnitude limit as the 'test' and not the full apparent magnitude function in bins. I think this gets at the basic quantity of interest. it is a simple extension to test for a few different magnitude limits if the shape of the app_mag_func is important.

@evevkovacs
Copy link
Contributor

@duncandc How about using the fit for a start? Ideally, there would be papers which do the fitting and to which you could refer, but I understand that these may not exist yet for LSST-like depths. But initially, if we just want to get an idea of how the catalogs look, using Rachel's fit would be quickest. It is not obvious to me that the WGs will need fancier cuts or options, so it might be a good idea to get feedback from them before putting a lot of work into a more general structure.

@duncandc
Copy link

duncandc commented Dec 4, 2017

That's the path I am taking--The test itself is exceedingly simple. I am also interested in the data.

@rmandelb
Copy link
Contributor Author

rmandelb commented Dec 4, 2017

We definitely care about getting the shape of the number counts roughly right.

I just sent you some stuff by e-mail so you can download the data with the same SQL query and process it with my script (or modify the script as needed). I should emphasize that I only did the i-band cumulative counts, so if we want to test >1 band, we need to at least rerun the script for r-band (which I can do myself if you want, because I did download the r-band magnitudes) and possibly others (which would involve re-downloading the data with more information).

@yymao
Copy link
Member

yymao commented Dec 4, 2017

@duncandc are you thinking to generate the validation data on the fly every time when the test is run? Need to consider computational resources.

@duncandc
Copy link

duncandc commented Dec 4, 2017

I am not considering that. I am putting on the creating validation data hat.

@janewman-pitt-edu
Copy link

I agree we want to get the shape of the number counts right. However, we also want to be sure to do that test only where counts are highly complete (i.e. at least 0.5-1 mag brighter than where counts turnover, and probably avoiding the bright end to minimize impact from stars).

Best,

Jeff

@evevkovacs
Copy link
Contributor

@duncandc @janewman-pitt-edu @rmandelb @yymao Duncan has worked on a test and example plots can be seen here There are some normalization issues, but even so, we can now start to think about specifying validation criteria, such as defining a validation region to be used, and the required level of agreement with the data.

@rmandelb
Copy link
Contributor Author

@evevkovacs @duncandc - thanks for sharing these plots. Apologies if I am missing where to find this information, but what is each curve? (i.e., which is validation dataset vs. the simulation we are validating, and why does one plot have 3 curves but the other plots have 2 curves?)

Is there anything I can help with to try to track down normalization issues? For example, Duncan, if this comes from your own recalculation based on downloading HSC data, do you want to compare with the normalizations that I have calculated to see if we agree about the data?

Just to throw out an idea for validation criteria: I think we'd like to sample a few i-band magnitude values within the range where HSC is clearly complete. Let's say i=24, 24.5, 25. We'd like N(<i) in the data and simulations to agree to within ~20% at those values of i-band magnitude. It would also be nice to take the power-law extrapolation of the HSC density for 22<i<25, and require N(<i=27) in the sims and extrapolated power-law from data to agree to within ~40%.

@yymao
Copy link
Member

yymao commented Dec 12, 2017

@rmandelb the one with 2 curves + data points is the summary plots (i.e., overplotting all catalogs), and other plots are just for individual catalogs (i.e., no overplotting other catalogs) --- the contents are the same though.

Maybe you can help reviewing this notebook @duncandc put together for getting the HSC data? You can also find the corresponding PR at LSSTDESC/descqa#46

@duncandc
Copy link

As for the normalization, I will check it against @rmandelb 's previous calculation.

Is there a better way to estimate "error bars" on N(<m) than back-of-the-envelope estimates? Asking for O(10%) agreement seems arbitrary to me.

@rmandelb
Copy link
Contributor Author

I'm happy to review the notebook and PR. Some cosmetic updates that would be helpful for people who look at this test:

  • the plot should say which is validation data and which is the sim
  • the summary plot should say which sim is which
  • should have an axis label for the horizontal axis that says what band is being shown
  • it would be helpful to display number densities in number per square arcmin, rather than number per square degree (I know it's a minor point, but many observers have a rough rule of thumb for expected density per square arcmin in imaging data, so looking at the plot in those units without having to divide by 3600 makes it easier to do sanity checks).
  • having the vertical axis explicitly say this is a number density rather than a raw number -- e.g., by noting the units -- would be helpful.

I will try to review the notebook and PR in the next day or so.

@yymao
Copy link
Member

yymao commented Dec 12, 2017

@rmandelb thanks for the suggestions. This validation test, as you can tell, is a sprint product and is still under review on a different PR LSSTDESC/descqa#47, so your suggestions are very useful and I believe @duncandc will implement them :)

@rmandelb
Copy link
Contributor Author

@duncandc - thanks for checking against my previous calculation, that will be helpful.

Is there a better way to estimate "error bars" on N(<m) than back-of-the-envelope estimates?

The statistical uncertainty on the N(<m) due to number counts alone is very small given the area of the HSC survey and the sims, while systematic uncertainty due to e.g. whether we've defined the samples in totally consistent ways in data and sims is larger and hard to quantify, but roughly speaking can give order 10% uncertainties (this is pretty much an empirical statement based on comparing sims and data many times and seeing how big changes can arise due to that kind of thing). Maybe somebody else will have a better idea than I do...

@evevkovacs
Copy link
Contributor

evevkovacs commented Dec 12, 2017

@duncandc @rmandelb @yymao So to summarize a first pass attempt at validation criteria, we have:
1) Add a 10% systematic uncertainty to the HSC data points. (I would guess these are correlated, but we can ignore this for a first pass)

  1. 2) Check that the catalog data agrees within 20% ( so that's 20% + 10% added in quadrature) in the validation region 24 < i < 25
  2. 3) Check that the catalog data agrees within 40% (ie 40% + 10% in quadrature) with the extrapolation to the power law fit to the HSC density in the validation region i = 27

Items 1) and 2) 2) and 3) are normalization checks. Initially just check if the catalog falls in the required range; we can come up with something fancier if need be.

Plot enhancements in addition to those mentioned by Rachel are:

  1. Show the HSC power-law fit on the plot (and add to legend)
  2. Show the validation regions as shaded vertical bands. Actually, I think we opted to shade the non-validation region for DESCQA-v1, but we'll have to see how it looks.

Does anyone have any other suggestions or improvements?

@rmandelb
Copy link
Contributor Author

Not exactly. I wasn't proposing to do (1) in addition to the others. Rather, I was saying that (2) and (3) are meant to incorporate that 10% uncertainty.

@evevkovacs
Copy link
Contributor

@rmandelb Thanks very much. I adjusted the comment above and struck through the incorrect steps.

@rmandelb
Copy link
Contributor Author

@yymao @duncandc - I wanted to check something: is it OK if I wait to review the PR until after the changes I already suggested get implemented? I would prefer to just review it once, if that works for you. Ping me once the changes are made and it's ready for review?

@duncandc
Copy link

@rmandelb, sure thing. I will ping you when I have made those changes.

@yymao
Copy link
Member

yymao commented Dec 14, 2017

@rmandelb yes --- and JFYI, we (the DESCQA team) encourage early PR so people can see the progress. Thus, it is totally OK to defer review until the PR submitter says it's ready for review! In the meantime, leave comments and suggestions in the PR (i.e., not here) would be very helpful :)

@katrinheitmann
Copy link
Contributor

Closing this issue now since it is taken care of in descqa repo: LSSTDESC/descqa#7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants