Simpler dataset handling

baldassarreFe · May 16, 2020 · 1c2df17 · 1c2df17
1 parent f7f4b9a
commit 1c2df17
Show file tree

Hide file tree

Showing 8 changed files with 175 additions and 214 deletions.
diff --git a/DATASET.md b/DATASET.md
@@ -15,7 +15,7 @@ continuous [TFRecords](https://www.tensorflow.org/programmers_guide/datasets).
 All the data preparation steps are independent and persisted on the disk, the default (and recommended) folder structure is:
 
 ```
-~/imagenet
+./data
 ├── fall11_urls.txt
 ├── imagenet1000_clsid_to_human.pkl
 ├── inception_resnet_v2_2016_08_30.ckpt
@@ -24,68 +24,61 @@ All the data preparation steps are independent and persisted on the disk, the de
 └── tfrecords
 ```
 
-### Imagenet labels
+### Imagenet images
+To download the images from ImageNet, we provide a script that takes as input a file containing image URLs (one per line).
+You are not restricted in any way to use images from ImageNet, you can provide any list of URLs to the script and it will take care of the download.
+See [unsplash.txt](data/unsplash.txt) for an example.
+Also, if you already have a collection of images for training, place them in `data/original` and skip to the next step.
 
-Get ImageNet human-readable labels. 
+> **Note:**
+> There used to be a [file](http://www.image-net.org/imagenet_data/urls/imagenet_fall11_urls.tgz) 
+> containing the image URLs for ImageNet 2011 available without registration on the 
+> [official website](http://image-net.org/download-imageurls).
+> Since the link appears to be down, you may want to use this 
+> [non-official file](http://github.com/akando42/1stPyTorch/blob/master/fall11_urls.txt) instead.
 
 ```bash
-$ wget https://gist.githubusercontent.com/yrevar/6135f1bd8dcf2e0cc683/raw/d133d61a09d7e5a3b36b8c111a8dd5c4b5d560ee/imagenet1000_clsid_to_human.pkl
+wget -O 'data/imagenet_fall11_urls.txt' 'https://github.com/akando42/1stPyTorch/raw/master/fall11_urls.txt'
+python -m koalarization.dataset.download 'data/imagenet_fall11_urls.txt' 'data/original'
 ```
 
-### Getting the images from Imagenet
-To downoad ImageNet dataset, we provide a script which requires an input `txt` file containing the URLs to the images. 
+The download script also accepts a URL as `source`, but downloading the URL file separately 
+and passing it as `path/to/urls.txt` is highly recommended. 
+Use `-h` to see all available options.
 
-> **Note**: Previously, there was a file containing the URLs to all images from ImageNet 2011 dataset on the [official
-> website](http://www.image-net.org/imagenet_data/urls/imagenet_fall11_urls.tgz), but it is currently down. You may want to use this [non-official
-> file](http://github.com/akando42/1stPyTorch/blob/master/fall11_urls.txt) instead.
-
-
-```bash
-$ python -m koalarization.dataset.download <args>
-```
-
-Passing `-s path/to/fall11_urls.txt` is **highly recommended** over passing a url.
-
-Use `-h` to see the available options
-
-### Resizing the images for the model
-To be able to train in batches, we resize all images (in particular, we use shape _299 x 299_). Use the following script to achieve this:
+### Resizing for training
+To be able to train in batches, we resize all images to `299x299`. 
+Use the following script to achieve this:
 
 ```bash
-$ python -m koalarization.dataset.resize <args>
+python -m koalarization.dataset.resize 'data/original' 'data/resized'
 ```
 
 Use `-h` to see the available options
 
 ### Converting to TFRecords
-
+First download the pretrained Inception model for feature extraction, then use the `lab_batch` script to process all images from the resized folder:
 ```bash
-$ python -O -m koalarization.dataset.lab_batch <args>
+wget -O - 'http://download.tensorflow.org/models/inception_resnet_v2_2016_08_30.tar.gz' | tar -xzv -C 'data'
+python -m koalarization.dataset.lab_batch -c 'data/inception_resnet_v2_2016_08_30.ckpt' 'data/resized' 'data/tfrecords'
 ```
 
-Passing `-c path/to/inception_resnet_v2_2016_08_30.ckpt` is highly recommended
-over passing a url. To download the checkpoint it separately:
+If `-c` is omitted the script will download the checkpoint by itself, however downloading the checkpoint separately is highly recommended.
+Use `-h` to see the available options.
 
+### Validation set
+Some tfrecords are selected to be used as a validation set. This is done by simply renaming, for example:
 ```bash
-$ wget http://download.tensorflow.org/models/inception_resnet_v2_2016_08_30.tar.gz
-$ tar -xvf inception_resnet_v2_2016_08_30.tar.gz
+mv 'data/tfrecord/lab_images_0.tfrecord' 'data/tfrecord/val_lab_images_0.tfrecord'
 ```
 
-Omitting the `-O` (optimize) will print all image names at the moment they are written to
-a TFRecord. These prints will most likely appear all at once, 
-after TensorFlow has written the batch on disk and passes the control back to Python.
-
-Use `-h` to see the available options
-
 ## Space on disk notes
 
 ### The images
-
 Out of the first 200 links, we get 131 valid images, that in their original
 size take up a total of 17MB and then 2.5MB once resized to 299x299.
 
 ### The TFRecords
-
 Originally, the images are stored using the `jpeg` compression, that makes their
 size pretty small. On the other hand, once stored in a TFRecord they will simply
 be in raw byte format and take up much more space.
@@ -99,6 +92,7 @@ Keep in mind that one example is made of:
 To save space we can use one of TFRecord compression options, or compress the
 files after creation with a command like:
 
-```
-$ 7z a -t7z -m0=lzma -mx=9 -mfb=64 -md=32m -ms=on "$RECORD.7z" "$RECORD"
+```bash
+RECORD='data/tfrecord/lab_images_0.tfrecord'
+7z a -t7z -m0=lzma -mx=9 -mfb=64 -md=32m -ms=on "$RECORD.7z" "$RECORD"
 ```
diff --git a/INSTRUCTIONS.md b/INSTRUCTIONS.md
@@ -5,16 +5,16 @@ The project is based on Python 3.6, to manage the dependencies contained in
 [`requirements.txt`](requirements.txt) a virtual environment is recommended.
 
 ```bash
-$ python3.6 -m venv venv
-$ source venv/bin/activate
-$ pip install -e .
+python3.6 -m venv venv
+source venv/bin/activate
+pip install -e .
 ```
 
 Even better, use a [Conda environment](https://docs.conda.io/):
 ```bash
-$ conda create -y -n koalarization python=3.6
-$ conda activate koalarization
-$ pip install -e .
+conda create -y -n koalarization python=3.6
+conda activate koalarization
+pip install -e .
 ```
 
 For GPU-support, run:
@@ -44,10 +44,12 @@ Before training, ensure that the folder `data` contains:
   (just rename some training records as validation, but do it before any training!)
 
 The training script will train on all the training images, and regularly 
-checkpoint the weights and save to disk some colored images from the validation set. 
+checkpoint the weights and save to disk some colored images from the validation set.
+
+All training logs, metrics and checkpoints are saved in `runs/run_id`.
 
 ```bash
-$ python -m koalarization.train \
+python -m koalarization.train \
   --run-id 'run1' \
   --train-steps 100 \
   --val-every 20 \
@@ -58,7 +60,7 @@ The evaluation script will load the latest checkpoint, colorize images from the
 records and save them to disk. At the moment, it is not possible to operate on normal image
 files (e.g. `jpeg` or `png`), but the images must be processed as TFRecords first.
 ```bash
-$ python -m koalarization.evaluate \
+python -m koalarization.evaluate \
   --run-id 'run1' \
   'data/tfrecords' 'runs/'
 ```
diff --git a/setup.py b/setup.py
@@ -4,38 +4,40 @@
 import glob
 
 this_directory = os.path.abspath(os.path.dirname(__file__))
-with open(os.path.join(this_directory, 'README.md')) as f:
+with open(os.path.join(this_directory, "README.md")) as f:
     long_description = f.read()
 
-with open(os.path.join(this_directory, 'requirements.txt')) as f:
+with open(os.path.join(this_directory, "requirements.txt")) as f:
     requirements = f.readlines()
 
 setup(
-    name='deep-koalarization',
+    name="deep-koalarization",
     version="0.2.0",
     description="Keras/Tensorflow implementation of our paper Grayscale Image Colorization using deep CNN and "
-                "Inception-ResNet-v2",
+    "Inception-ResNet-v2",
     long_description=long_description,
-    long_description_content_type='text/markdown',
-    url='http://github.com/baldassarreFe/deep-koalarization',
-    author='Federico Baldassare, Diego González Morín, Lucas Rodés-Guirao',
-    license='GPL-v3',
+    long_description_content_type="text/markdown",
+    url="http://github.com/baldassarreFe/deep-koalarization",
+    author="Federico Baldassare, Diego González Morín, Lucas Rodés-Guirao",
+    license="GPL-v3",
     install_requires=requirements,
-    extras_require={"gpu": ['tensorflow-gpu']},
-    packages=find_packages('src'),
-    package_dir={'': 'src'},
+    extras_require={"gpu": ["tensorflow-gpu==1.3.0"]},
+    packages=find_packages("src"),
+    package_dir={"": "src"},
     # namespace_packages=['koalarization'],
-    py_modules=[os.path.splitext(os.path.basename(path))[0] for path in glob.glob('src/*.py')],
+    py_modules=[
+        os.path.splitext(os.path.basename(path))[0] for path in glob.glob("src/*.py")
+    ],
     include_package_data=True,
     zip_safe=False,
     classifiers=[
         "Development Status :: 3 - Alpha",
-        "Programming Language :: Python :: 3.6"
+        "Programming Language :: Python :: 3.6",
     ],
-    keywords='Image colorization using Deep Learning CNNs',
+    keywords="Image colorization using Deep Learning CNNs",
     project_urls={
-        'Website': 'https://lcsrg.me/deep-koalarization',
-        'Github': 'http://github.com/baldassarreFe/deep-koalarization'
+        "Website": "https://lcsrg.me/deep-koalarization",
+        "Github": "http://github.com/baldassarreFe/deep-koalarization",
     },
-    python_requires='>=3.5'
-)
+    python_requires=">=3.5",
+)