[GSK-1279] More rigorous evaluation of significance of performance metrics #1162

mattbit · 2023-06-09T09:42:58Z

Following the feedback by user KD_A on reddit. They recommend more sound handling of statistical significance to prevent selection bias, in particular using a Benjamini-Hochberg procedure to control the false discovery rate.

The problem is that we currently test several data slice candidates + metric without accounting for selection bias → this can lead to a high number of false positive detections.

To do

Add simple stat tests to the current implementation and measure the significance on the test models already in pytest fixtures → do we have detections with high p-values?
If we do, check if we can set a FPR parameter in PerformanceBiasDetector and filter the detections based on their p-value with Benjamini-Hochberg procedure.

_{From SyncLinear.com | GSK-1279}

The text was updated successfully, but these errors were encountered:

mattbit · 2023-06-23T13:34:31Z

This was mostly adressed in #1193, althought the Benjamini–Hochberg procedure is not enabled by default (because statistical tests on metrics like balanced accuracy pose problems).

mattbit · 2023-10-18T17:09:02Z

Not completed yet

kddubey · 2024-04-30T06:27:33Z

Hello,

It's KD_A from Reddit. I purged my account recently, so the linked Reddit comment is no longer available. Posting it and the next reply here for posterity:

First reply

Thanks for the response.

I realized I misphrased the problem as multiple testing. It's more accurate to categorize it as selection bias: if 100s of slice+metric combinations are examined, then the observed worst n drops from the global average (where n is kind of small) are likely overestimates. The degree of overestimation gets worse as the rank of the drop gets closer to 1. See the intro of this paper (which also contains a bias-corrected estimator):

Efron, Bradley. "Tweedie’s formula and selection bias." Journal of the American Statistical Association 106.496 (2011): 1602-1614.

Given this fact, my main concern as a user would be how much I should trust the alerts. Have Giskard's alerts and estimates been empirically evaluated? For example, for alerts, what's the probability that a drop is practically significant/worrisome given that Giskard alerted on it? One way to answer this question is to split off another large test set, and evaluate Giskard's alerts (from an independent test set) on it.

the statistical significance is always pretty high

2 potential concerns:

In making this determination, were p-values examined before or after correcting for multiple comparisons? Correction methods can greatly increase p-values; they can turn many significant findings into insignificant ones. So it'd be important to make this determination after correction.
When running these tests, was the null value the global average, and the alternative hypothesis that the drop is less than the global average? This may not be the right test to run if the user only cares about slices where the drop is "practically significantly" worse. For example, for a global accuracy of 0.78, it's reasonable that a user only cares about a drop which is at least 0.28 b/c that's worse than 0.5 accuracy. Testing for slice accuracy < 0.5 will result in much higher p-values than testing for slice accuracy < 0.78.

I'm not advocating for displaying hypothesis test results to users. But I do think that running good testing procedures in the background will help in filtering out false alerts.

When I started working on this, I thought measuring significance (and thus handling multiple comparison) would be a major concern and started looking in things like alpha spending/alpha investing to control false positives.

In case you end up going down this route again, the Benjamini-Hochberg procedure is a super easy and fast way to control the false discovery/alert rate. It seems more applicable to Giskard than sequential correction procedures.

Second reply

If you have better recommendations on how to improve this while keeping it simple, I’m definitely interested.

A test for relative difference in (mean) score could work. Assuming higher scores are better:

H0: (complement score - slice score)/(complement score) = 1/5

H1: (complement score - slice score)/(complement score) > 1/5

The null value, 1/5, was chosen assuming that the user only cares about differences where the model performs 80% as well (or worse) on the selected slice as it does on the complement. Feel free to decrease it to e.g., 1/10, b/c there's some tolerance for false positives.

Avoid worrying about analytically computing the distribution of the test statistic by running a permutation test. All you have to do is supply a function which computes the relative difference in means as the statistic to scipy.stats.permutation_test. Here's an example I just wrote for the accuracy metric.

Everything else you mentioned makes sense. Thank you for the discussion!

mattbit self-assigned this Jun 9, 2023

mattbit added the scan Created by Linear-GitHub Sync label Jun 9, 2023

luca-martial closed this as completed Oct 18, 2023

mattbit reopened this Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSK-1279] More rigorous evaluation of significance of performance metrics #1162

[GSK-1279] More rigorous evaluation of significance of performance metrics #1162

mattbit commented Jun 9, 2023 •

edited

Loading

mattbit commented Jun 23, 2023

mattbit commented Oct 18, 2023

kddubey commented Apr 30, 2024

[GSK-1279] More rigorous evaluation of significance of performance metrics #1162

[GSK-1279] More rigorous evaluation of significance of performance metrics #1162

Comments

mattbit commented Jun 9, 2023 • edited Loading

To do

mattbit commented Jun 23, 2023

mattbit commented Oct 18, 2023

kddubey commented Apr 30, 2024

mattbit commented Jun 9, 2023 •

edited

Loading