This repository contains the code to reproduce all evaluations in the paper "Intrinsic Self-Supervision for Data Quality Audits". It builds on the SelfClean package by including evaluation scenarios and competing methods.
Context-aware self-supervised learning combined with distance-based indicators is very effective to identify data quality issues in computer-vision datasets.
Run make
for a list of possible targets.
Run this command for installation
make install
To reproduce our experiments, we list the detailed comments needed for replicating each experiment below. Note that our experiments were run on a DGX Workstation 1. If less computational power is available, this would require adaptations of the configuration file.
Comparison on data quality issues (i.e. Table 1 and Table 10):
python -m src.evaluate_synthetic --config configs/evaluation.yaml
Influence of contamination (i.e. Figure 3):
python -m src.evaluate_mixed_contamination --config configs/evaluation.yaml
Comparison with metadata (i.e. Table 11):
python -m src.evaluate_metadata --config configs/metadata_comparison.yaml
Comparison with human annotators (i.e. Table 15 and Figure 9):
python -m src.evaluate_human_annotators
Influence of contamination (i.e. Table 2, 6, 7, 8, 9 and Figure 4, 5):
python -m src.evaluate_mixed_contamination --config configs/evaluation.yaml
Note: With changes to the configs/evaluation.yaml
, namely pretraining_type
and model_config
.
Influence of dataset cleaning (i.e. Table 3, 13):
python -m src.cleaning_performance --config configs/cleaning_performance.yaml
black
for code styleisort
for import sorting- docstring style:
sphinx
pytest
for running tests
To set up your dev environment run:
pip install -r requirements.txt
# Install pre-commit hook
pre-commit install
To run all the linters on all files:
pre-commit run --all-files