Skip to content

Notebook with the necessary instructions and code to clean your computer vision dataset

Notifications You must be signed in to change notification settings

drewm23/dataset-cleaner

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dataset-cleaner

Notebook with instructions to clean your dataset in with interactive widgets in Jupyter Notebook.

This project consists of a dataset cleaner in Jupyter Notebook with the following functionalities:

  • Image Relabeler and Deleter: suggests mislabeled images and gives the user the option to relabel them (by Zach Caceres with help from Jason Patnick) or directly delete them from the dataset.

Let's see an example in CIFAR. This doesn't look like a truck does it? Relabel/delete away!

  • Image Duplicate Deleter: suggests potential duplicates in the dataset and gives the user the option to delete them (by Francisco Ingham)

Let's see a few examples in CIFAR. Found a duplicate? Delete away!

Just be careful that the images you are deleting are actual duplicates and not augmented images like this one:

How much difference can be considered a duplicate? That's a good question and the decision is yours. If one image is augmented many times and the other ones are not, then the model will disproportionately weigh that image against the others and that is not what you want (you don't want the model to favor one type of frog because it saw 5 images of it vs 1 image of the other types). However, augmentation generally is a good thing, since it enriches the dataset and reduces overfitting. Conclusion: if you can have augmentation without excessively augmenting one image over others, then keep it! Otherwise you should delete some of the oversampled examples.

About

Notebook with the necessary instructions and code to clean your computer vision dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%