Reference Drift Metrics #426

emrynHofmannElephant · 2024-10-04T07:36:41Z

When calculating univariate drift, you "fit" the drift on the reference. How are the drift metrics of the chunks in the reference data then calculated? - Are they compared to the overall distribution of the reference data?

jakubnml · 2024-10-04T07:59:38Z

Yes, that's how it is done currently and we are aware it is not the optimum way. Good job on spotting that though 👏

So the correct way is: when calculating drift metric for a chunk which is a subset of the reference data, the observations that belong to that chunk should be "removed" from the reference data for the comparison. Just like in Cross Validation. Otherwise the some of the drift metrics are lower than they really should, because one dataset (reference chunk) is a subset of the other (whole reference). As an effect, in an extreme situation, one may have perfectly iid data, but the drift metrics on reference chunks will be lower than on monitored (analysis) data - yet with iid data they shouldn't.

We plan to fix this. Either by enforcing the new correct way or making it the default one, but keeping both and making the old way optional as it sometimes may be beneficial because of its lower computational cost. I can't say exactly when because our current focus is on research related to performance estimation methods.

Before we fix it, if you really want, you can hack it on your own - by fitting calculator multiple times on subsets of reference data that do not contain the reference chunk of interest.

nnansters added the enhancement New feature or request label Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reference Drift Metrics #426

Reference Drift Metrics #426

emrynHofmannElephant commented Oct 4, 2024

jakubnml commented Oct 4, 2024

Reference Drift Metrics #426

Reference Drift Metrics #426

Comments

emrynHofmannElephant commented Oct 4, 2024

jakubnml commented Oct 4, 2024