Implement regression tests #174

fred3m · 2020-05-22T13:27:04Z

I'm not sure the best way to do this, but PR #173 brought up that sometimes we implement changes that can affect the performance of an algorithm and right now the only way that we test that is to have me run on the HSC test data after the code has already been merged. The Rubin Observatory DM pipeline does have a method for tracking quantities like I describe in this issue, but that too will only do so after the code has been merged. So we won't learn about regressions until master has been updated, which is not what I would like.

Ideally we should have some subset of images where we track things like maximum absolute value in a residual, sum of the absolute value of the residual, loss of the initial model, total number of iterations, final photometry, final colors, etc. We should chose some subset of the HSC images with fake sources injected to run our tests on. The test blends should cover a wide variety of blends, like a single source, two source blends (the majority of the cases for HSC/Rubin Observatory), and multi-source blends with different degrees of blending. Ideally it should take < 10 minutes to run the entire dataset, which, with the current runtime, means about 1000 total sources in all of the blends combined.

I'm a bit fuzzy about how to do the next part, but here's what I would like (in a perfect world), just to open up the discussion: Each time a PR is made, a script runs scarlet on the set of test blends and generates the quantities that we want to track. Those quantities are added to a data file that keeps track of how each of those quantities have changed with each new PR that is merged, so that we can verify that a PR isn't breaking something.

The tricky part to me is how to do this before the code is merged to master. My first thought (and one that I admit is not very good) is to make a new scarlet_verification repo that contains only a csv file to keep track of all the metric and a docs directory with a single notebook that displays a plot for each tracked quantity over time. We update our travis script so that each time someone updates a PR in scarlet it checks the csv file for an entry with that PR number. If it already exists then it overwrites the entry, otherwise it creates a new entry in the file with the PR number and the measurements for each quantity, then builds the docs for scarlet_verification. That way the scarlet_verification repo will always give a snapshot of the current values of each quantity, displaying them in plots in the docs, for all of the PRs (and master) in scarlet.

Alternatively there might be a way within Rubin Observatory DM to trigger its verification framework with a scarlet PR, in which case we'll get a much better set of tools that can (hopefully) be kept up to date in realtime as well. So I'll check on this from my end.

The text was updated successfully, but these errors were encountered:

fred3m · 2020-05-22T21:00:30Z

It sounds like there will be a way to track this with the SQuaRE verification framework. Roughly if we develop a set of metrics (avg runtime, total iterations, max iterations, final color residuals, etc) and run the regression tests in scarlet to calculate those for a given PR in Travis, there should be a way to export that data to SQuaRE to track how those change over time and allow us to notice a regression on the test images before the code is merged to master.

So I propose that we open up the remainder of this issue to discuss what are the metrics that we'd like to track, and generate a set of images that we will use to calculate those metrics. I'll try to work on proposing an initial set of images and a suggested set of metrics early next week, but feel free to chime in if you have other ideas before then.

pmelchior · 2020-05-25T20:20:00Z

So, to clarify, we run the regression test through Jarvis and only report the results to SQuaRE? I'm worried about Jarvis timeouts.

fred3m · 2020-05-25T20:35:26Z

I'm not too worried about Travis timeouts. Per their docs:

It is very common for test suites or build scripts to hang. Travis CI has specific time limits for each job, and will stop the build and add an error message to the build log in the following situations:

When a job produces no log output for 10 minutes.

When a job on a public repository takes longer than 50 minutes.

When a job on a private repository takes longer than 120 minutes.

Since most ground-based blends run in 1-10s, with larger blends taking 1-2 minutes, we should be able to run a few dozen test blends and still fit in the 50 minute window, we'll just have to create a log output after every test so that we don't hit the 10 minute timeout.

The tricky part will be multi-resolution testing. Those take much longer to run, so if we develop a set of multi-resolution test images we may have to find some other way to perform regression testing, or just find 1 or 2 examples that we feel will be representative of multi-resolution blends in general.

pmelchior · 2020-05-26T21:04:32Z

Do you want to trigger this for every branch or only after when merging to master? I can see waiting for 1h until Jarvis comes back as quite annoying

fred3m · 2020-05-26T21:13:37Z

I want to do it before we merge to master, that way we catch breaking changes before they happen.

But, I don't think this will take anywhere near 1h to run, for example I'm looking for test blends now and we can run the 100 best 2 source blends in under 2 minutes. So I figure we'll use those, another 100 or so blends of larger size, and a dozen or so select blends that we know are challenging by visual inspection (I'll visually inspect all of them before including them in the test repo). So I hope to have the whole thing run in under 20 minutes, maybe even 10 (assuming that the Travis servers have speeds comparable to lsst-dev).

That doesn't seem like too long to wait after pushing a commit to a PR before merging it, compared to the benefit that we get from it.

pmelchior · 2020-05-26T21:30:14Z

Fine if it's <~ 10 minutes.

Another concern: We have a minimal regression test from the logL in the quickstart guide. I understand that you want to expand the test cases. We already routinely see changes on the logL, and we understand that we changed something that causes a different logL. I don't want to be in a situation where we have to update multiple tests if we see some change.

pmelchior · 2020-05-26T21:34:32Z

So, to clarify from your previous comment (you like to see the change before it is merged into master), you want a situation that fails a test, so that we cannot merge. In contrast to simply running the regression tests and reporting the new master values to SQuaRE.

I'm still not a fan of striving for test completeness on our end.

fred3m · 2020-05-26T21:52:01Z

I think that I should clarify what I mean by regression tests. This won't be as hardcoded as something like checking the value of a logL and failing the test if it is under a certain value and these won't be tests that "fail." Even the DM stack doesn't go as far as checking those things.

My current plan (open to suggestions) is as follows:
The test will run scarlet with the default configuration on three different sets of blends:

~100-200 blends that are accurate in the current master
~100-200 blends that are well representative of the overall dataset
a dozen or so difficult blends that we want to track our effectiveness or progress at solving.

For the blends in all three datasets:
We'll calculate the average value and standard deviation for a set of quantities of the entire group, for example

initial logL (to track initialization)
runtime
number of iterations
final logL (to track the accuracy of the entire blend)
photometric accuracy (of the fake source(s) in the blend)
accuracy of shape measurements (of the fake source(s) in the blend)

The test script will update a data file kept in a new repo, something like scarlet_tests, that will create/modify an entry with the scarlet PR number in the data table, by creating a PR in scarlet_tests. Once the PR is created in scarlet_tests that will trigger github to build the scarlet_tests docs, which will consist of a jupyter notebook with a series of plots for the various tracked quantities.

For the special blends in group 3:
We'll also display the results of show_scene from scarlet master and the new PR, so that we can see if any of them have significantly improved or degraded.

So the idea is to give us a way to track how scarlet performs over time on a sample of objects and it will be the decision of the code reviewers whether or not any loss of accuracy or performance is acceptable. The only thing that will need to change once these tests have been created is the code to execute scarlet, using the default parameters, if that API changes at all.

pmelchior · 2020-05-26T23:00:36Z

That would be good, yes. It sounds like this wouldn't need changes on our side beyond modifying the Jarvis script. scarlet_tests could have the data files as well and probably needs some html export so that we can quickly parse the results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement regression tests #174

Implement regression tests #174

fred3m commented May 22, 2020

fred3m commented May 22, 2020

pmelchior commented May 25, 2020

fred3m commented May 25, 2020

pmelchior commented May 26, 2020

fred3m commented May 26, 2020

pmelchior commented May 26, 2020

pmelchior commented May 26, 2020

fred3m commented May 26, 2020

pmelchior commented May 26, 2020

Implement regression tests #174

Implement regression tests #174

Comments

fred3m commented May 22, 2020

fred3m commented May 22, 2020

pmelchior commented May 25, 2020

fred3m commented May 25, 2020

pmelchior commented May 26, 2020

fred3m commented May 26, 2020

pmelchior commented May 26, 2020

pmelchior commented May 26, 2020

fred3m commented May 26, 2020

pmelchior commented May 26, 2020