Modern data cleaning approaches will be presented, explained, and critically reviewed with a focus on emerging tools for image dataset curation. Automatic detection of data quality issues in data collections of growing size will be motivated by reviewing contamination in popular benchmarks and by assessing its impact on the training and evaluation of machine learning models. Data cleaning will be shown to be complementary to learning with noise, although it is not quite as known. Particular attention will be paid to near-duplicate images, which can lead to train-evaluation data leaks, irrelevant samples, which are invalid within their context, and label errors, which corrupt the learning signal. The major repositories containing resources for data cleaning will be presented with their strengths and weaknesses, used in guided examples, and participants will be encouraged to clean their own datasets in the closing part of the tutorial.
There are several possibilities to install the needed libraries for this tutorial, depending on your preferences:
- if you use Docker you can start a jupyter notebook server with make by running
make start_jupyter
- if you use venv's or want to install it locally you can pip install
requirements.txt
and your jupyter notebook - if you do not want to install anything locally you can run everything on Google Colab by clicking the button below, remember to change the runtime to GPU.
NOTE: We recommend using Google Colab to run the tutorial. We also provide setup for a virtual environment and Docker. However, we cannot guarantee that the setup will work on your machine. These options may be the best if you do not want to upload your datasets, but depending on your hardware and internet connection, you may have to deal with longer install times, disk space requirements, or slower computations.
In the first tutorial, we will see how difficult it can be to perform data cleaning for image datasets traditionally or manually. Then, in the next tutorials, we will examine how easy this task can be made when relying on data-centric cleaning frameworks.
00 |
Traditional (manual) data cleaning: Showcases how manual data cleaning is typically done and calculates the effort required for exhaustive annotation.
🗃️ Dataset: Oxford-IIIT Pet, Imagenette, your own. |
|
01 |
FastDup: Learn how to analyze and clean datasets using FastDup, the preferred solution for very large data collections.
🗃️ Dataset: Oxford-IIIT Pet, Imagenette, your own. |
|
02 |
CleanLab: Learn how to analyze and clean datasets using CleanLab (DataLab), the preferred solution for reliable results.
🗃️ Dataset: Oxford-IIIT Pet, Imagenette, your own. |
|
03 |
SelfClean: Learn how to analyze and clean datasets using SelfClean, the preferred solution for small to medium datasets with an emphasis on the highest data quality.
🗃️ Dataset: Oxford-IIIT Pet, Imagenette, your own. |
|
For more detailed tips during the hands-on session, consult our dedicated page