Skip to content
This repository has been archived by the owner on Jan 7, 2025. It is now read-only.

Latest commit

 

History

History
143 lines (118 loc) · 4.33 KB

StandardDatasets.md

File metadata and controls

143 lines (118 loc) · 4.33 KB

Standard Datasets

Table of Contents

Overview

DIGITS will download some standard datasets for you and store them for you locally in the format that DIGITS expects (see Image Folder Format for a detailed explanation). Once these folders are created, you can use them to create your datasets with DIGITS.

HTML Form

$ python -m digits.download_data -h
usage: __main__.py [-h] [-c] dataset output_dir

Download-Data tool - DIGITS

positional arguments:
  dataset      mnist/cifar10/cifar100
  output_dir   The output directory for the data

optional arguments:
  -h, --help   show this help message and exit
  -c, --clean  Clean out the directory first (if necessary)

MNIST

Yann LeCun provides a dataset of 28x28 grayscale images of handwritten digits. You can read all about it here: http://yann.lecun.com/exdb/mnist/

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.

Run this:

$ python -m digits.download_data mnist ~/mnist

And these folders and files will be created for you (images and temporary files omitted):

mnist/
├── train/
│   ├── 0/
│   ├── 1/
│   ├── 2/
│   ├── 3/
│   ├── 4/
│   ├── 5/
│   ├── 6/
│   ├── 7/
│   ├── 8/
│   ├── 9/
│   ├── labels.txt
│   └── train.txt
└── test/
    ├── 0/
    ├── ...
    ├── 9/
    ├── labels.txt
    └── test.txt

Then, you can use ~/mnist/train for your training images and ~/mnist/test for your validation or test images.

CIFAR

Alex Krizhevsky provides two datasets of 32x32 color images. You can read all about them here: http://www.cs.toronto.edu/~kriz/cifar.html

CIFAR10

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

Run this:

$ python -m digits.download_data cifar10 ~/cifar10

And these folders and files will be created for you (images and temporary files omitted):

cifar10
├── train/
│   ├── airplane/
│   ├── automobile/
│   ├── bird/
│   ├── cat/
│   ├── deer/
│   ├── dog/
│   ├── frog/
│   ├── horse/
│   ├── ship/
│   ├── truck/
│   ├── labels.txt
│   └── train.txt
└── test/
    ├── airplane/
    ├── ...
    ├── truck/
    ├── labels.txt
    └── test.txt

Then, you can use ~/cifar10/train for your training images and ~/cifar10/test for your validation or test images.

CIFAR100

This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).

Run this:

$ python -m digits.download_data cifar100 ~/cifar100

And these folders and files will be created for you (images and temporary files omitted):

cifar100/
├── coarse/
│   ├── train/
│   │   └── ...
│   ├── test/
│   │   └── ...
│   ├── labels.txt
│   ├── test.txt
│   └── train.txt
└── fine/
    ├── train/
    │   └── ...
    ├── test/
    │   └── ...
    ├── labels.txt
    ├── test.txt
    └── train.txt

If you want to use the coarse dataset (10 classes), use ~/cifar100/coarse/train and ~/cifar100/coarse/test.

If you want to use the fine dataset (100 classes), use ~/cifar100/fine/train and ~/cifar100/fine/test.