-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement regression tests #174
Comments
It sounds like there will be a way to track this with the SQuaRE verification framework. Roughly if we develop a set of metrics (avg runtime, total iterations, max iterations, final color residuals, etc) and run the regression tests in scarlet to calculate those for a given PR in Travis, there should be a way to export that data to SQuaRE to track how those change over time and allow us to notice a regression on the test images before the code is merged to master. So I propose that we open up the remainder of this issue to discuss what are the metrics that we'd like to track, and generate a set of images that we will use to calculate those metrics. I'll try to work on proposing an initial set of images and a suggested set of metrics early next week, but feel free to chime in if you have other ideas before then. |
So, to clarify, we run the regression test through Jarvis and only report the results to SQuaRE? I'm worried about Jarvis timeouts. |
I'm not too worried about Travis timeouts. Per their docs:
Since most ground-based blends run in 1-10s, with larger blends taking 1-2 minutes, we should be able to run a few dozen test blends and still fit in the 50 minute window, we'll just have to create a log output after every test so that we don't hit the 10 minute timeout. The tricky part will be multi-resolution testing. Those take much longer to run, so if we develop a set of multi-resolution test images we may have to find some other way to perform regression testing, or just find 1 or 2 examples that we feel will be representative of multi-resolution blends in general. |
Do you want to trigger this for every branch or only after when merging to master? I can see waiting for 1h until Jarvis comes back as quite annoying |
I want to do it before we merge to master, that way we catch breaking changes before they happen. But, I don't think this will take anywhere near 1h to run, for example I'm looking for test blends now and we can run the 100 best 2 source blends in under 2 minutes. So I figure we'll use those, another 100 or so blends of larger size, and a dozen or so select blends that we know are challenging by visual inspection (I'll visually inspect all of them before including them in the test repo). So I hope to have the whole thing run in under 20 minutes, maybe even 10 (assuming that the Travis servers have speeds comparable to lsst-dev). That doesn't seem like too long to wait after pushing a commit to a PR before merging it, compared to the benefit that we get from it. |
Fine if it's <~ 10 minutes. Another concern: We have a minimal regression test from the logL in the quickstart guide. I understand that you want to expand the test cases. We already routinely see changes on the logL, and we understand that we changed something that causes a different logL. I don't want to be in a situation where we have to update multiple tests if we see some change. |
So, to clarify from your previous comment (you like to see the change before it is merged into master), you want a situation that fails a test, so that we cannot merge. In contrast to simply running the regression tests and reporting the new master values to SQuaRE. I'm still not a fan of striving for test completeness on our end. |
I think that I should clarify what I mean by regression tests. This won't be as hardcoded as something like checking the value of a logL and failing the test if it is under a certain value and these won't be tests that "fail." Even the DM stack doesn't go as far as checking those things. My current plan (open to suggestions) is as follows:
For the blends in all three datasets:
The test script will update a data file kept in a new repo, something like For the special blends in group 3: So the idea is to give us a way to track how scarlet performs over time on a sample of objects and it will be the decision of the code reviewers whether or not any loss of accuracy or performance is acceptable. The only thing that will need to change once these tests have been created is the code to execute scarlet, using the default parameters, if that API changes at all. |
That would be good, yes. It sounds like this wouldn't need changes on our side beyond modifying the Jarvis script. |
I'm not sure the best way to do this, but PR #173 brought up that sometimes we implement changes that can affect the performance of an algorithm and right now the only way that we test that is to have me run on the HSC test data after the code has already been merged. The Rubin Observatory DM pipeline does have a method for tracking quantities like I describe in this issue, but that too will only do so after the code has been merged. So we won't learn about regressions until master has been updated, which is not what I would like.
Ideally we should have some subset of images where we track things like maximum absolute value in a residual, sum of the absolute value of the residual, loss of the initial model, total number of iterations, final photometry, final colors, etc. We should chose some subset of the HSC images with fake sources injected to run our tests on. The test blends should cover a wide variety of blends, like a single source, two source blends (the majority of the cases for HSC/Rubin Observatory), and multi-source blends with different degrees of blending. Ideally it should take < 10 minutes to run the entire dataset, which, with the current runtime, means about 1000 total sources in all of the blends combined.
I'm a bit fuzzy about how to do the next part, but here's what I would like (in a perfect world), just to open up the discussion: Each time a PR is made, a script runs scarlet on the set of test blends and generates the quantities that we want to track. Those quantities are added to a data file that keeps track of how each of those quantities have changed with each new PR that is merged, so that we can verify that a PR isn't breaking something.
The tricky part to me is how to do this before the code is merged to master. My first thought (and one that I admit is not very good) is to make a new
scarlet_verification
repo that contains only a csv file to keep track of all the metric and a docs directory with a single notebook that displays a plot for each tracked quantity over time. We update our travis script so that each time someone updates a PR inscarlet
it checks the csv file for an entry with that PR number. If it already exists then it overwrites the entry, otherwise it creates a new entry in the file with the PR number and the measurements for each quantity, then builds the docs forscarlet_verification
. That way thescarlet_verification
repo will always give a snapshot of the current values of each quantity, displaying them in plots in the docs, for all of the PRs (and master) inscarlet
.Alternatively there might be a way within Rubin Observatory DM to trigger its verification framework with a scarlet PR, in which case we'll get a much better set of tools that can (hopefully) be kept up to date in realtime as well. So I'll check on this from my end.
The text was updated successfully, but these errors were encountered: