-
Notifications
You must be signed in to change notification settings - Fork 0
Drift Detection
viewser is designed to be used as a standalone client, or as a component in a machine-learning pipeline. A common issue in such pipelines is drifts or anomalies in the input data due, for example, to an unannounced change in the API of one of the raw data sources which causes erroneous data to be imported into the underlying database.
viewser contains machinery to help users detect these anomalies by performing simple analysis on the data retrieved from the service.
Two broad types of anomaly can be monitored
These examine the whole dataset, whatever its dimensions (thought on terms of time_units x space_units x features). The available anomaly detectors are
-
global_missingness: simply reports if the total fraction of missing (i.e. NaN) values across the whole dataset exceeds a threshold. Threshold should be a small number between 0 and 1, e.g. 0.05.
-
global zeros: reports if the total fraction of zero values across the whole dataset exceeds a threshold. Threshold should be a small number between 0 and 1, e.g. 0.05.
-
time_missingness: reports if the fraction of missingness across any (space_units x features) slices exceeds a threshold. Threshold should be a small number between 0 and 1, e.g. 0.05.
-
space_missingness: reports if the fraction of missingness across any (time_units x features) slices exceeds a threshold. Threshold should be a small number between 0 and 1, e.g. 0.05.
-
feature_missingness: reports if the fraction of missingness for any feature (over all time and space units) exceeds a threshold. Threshold should be a small number between 0 and 1, e.g. 0.05.
-
time_zeros: reports if the fraction of zeros across any (space_units x features) slices exceeds a threshold. Threshold should be a number between 0 and 1 and close to 1, e.g. 0.95.
-
space_zeros: reports if the fraction of zeros across any (time_units x features) slices exceeds a threshold. Threshold should be a number between 0 and 1 close to 1, e.g. 0.95.
-
feature_zeros: reports if the fraction of zeros for any feature (over all time and space units) exceeds a threshold. Threshold should be a number between 0 and 1 close to 1, e.g. 0.95.
These partition the dataset into three partitions, defined by two integers n and m. If the most recent time unit in the dateset is k: the test partition consists of the most recent n time units, i.e. k-n+1 to k inclusive (usually n would be 1 so the test parition simply consists of the most recent time unit k), the standard partition consists of the most recent k-m-n to k-n time units. The time units before k-m-n are discarded. The available anomaly detectors are
-
delta_completeness: reports, for each feature, if the ratio of missingness fractions in the test and standard partitions is greater than a threshold. Threshold should be a number between 0 and 1, e.g. 0.25.
-
delta_zeros: reports, for each feature, if the ratio of the fraction of zeros in the test and standard partitions is greater than a threshold. Threshold should be a number between 0 and 1, e.g. 0.25.
-
extreme_values: reports, for each feature, if the most extreme value in the test partition is more than (threshold) standard deviations from the mean of the data in the test partition. Threshold should be a number in the range 2-7, e.g. 5.
-
ks_drift: for each feature, performs a two-sample Kolmogorov-Smirnoff test (https://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test#Two-sample_Kolmogorov–Smirnov_test) between the data in the test and standard partitions and reports if (1/the returned p-value) exceeds a threshold. Threshold should be a large number, e.g. 100.
-
ecod_drift: for all features simultaneously, reports if the fraction of data-points considered outliers in the test partition exceeds that in the standard partition, according to an ECOD model (https://pyod.readthedocs.io/en/latest/_modules/pyod/models/ecod.html#ECOD) trained on the standard partition, exceeds a threshold. Threshold should be a number between 0 and 1, e.g. 0.25.
The drift detection machinery is run using an alternative to the data=my_queryset.fetch()
method, namely
data,alerts=my_queryset.fetch_with_drift_detection(start_date=m,end_date=n,drift_config_dict=drift_config_dict)
The start_date
and end_date
parameters must be specified - in particular, the recent-data anomaly detectors rely on knowing which is the most recent time-unit containing data.
The drift_config_dict must also be specified - it is the means by which users configure which drift-detectors to use, what thresholds should be set, and how long the test and standard partitions to used to assess recent-data anomalies should be.
The full form of this dictionary is
drift_config_dict={
'global_missingness': {'threshold': 0.01},
'time_missingness': {'threshold': 0.05},
'space_missingness': {'threshold': 0.05},
'feature_missingness': {'threshold': 0.05},
'global_zeros': {'threshold': 0.99},
'time_zeros': {'threshold': 0.95},
'space_zeros': {'threshold': 0.98},
'feature_zeros': {'threshold': 0.95},
'delta_completeness': {'threshold': 0.05},
'delta_zeroes': {'threshold': 0.10},
'ks_drift': {'threshold': 50},
'extreme_values': {'threshold': 3.0},
'standard_partition_length': 30,
'test_partition_length': 1
}
Unwanted detection functions can simply be deleted from the dictionary and they will not be run.
The data and alerts are fetched by executing
data,alerts=my_queryset.fetch_with_drift_detection(start_date=m,end_date=n,drift_config_dict=drift_config_dict)
To access the alerts, simply so
for alert in alerts:
print(alert)
Output will typically look something like (depending on which functions have been selected)
dataset zero passed
time-unit zero passed
feature zero passed
feature delta_completeness passed
[Input alarm: feature delta_zeroes; offender: ln_ged_sb, threshold: 0.01 Severity: 5 Timestamp: 2024-05-02 11:48:31
]
feature KS drift passed
feature extreme values passed