Project work done for course CS536:Pattern Recognition and Machine Learning at Rutgers University
Generative Adversarial Networks (GANs) are deep-learning based generative models. They are a generative model with implicit density estimation, part of unsupervised learning and are using two neural networks. Thus, we understand the terms “generative” and “networks” in “generative adversarial networks”.They can be used for a wide variety of purposes like style transfer, photo blending, image-to-image translation, etc. In this task, we aim to use GAN training real pizza and synthetic pizza datasets so that we can generate our own synthetic pizza.
Conclusions from this step: In this project step we tackled the problem of Generating fake images using Generative Adversarial Networks. We trained two GANs, one for the real pizza images and one for the synthetic pizza images. We observed that the images generated by the real pizza GAN were very visually appealing and looked very real. We can achieve better performance by using a smaller batch size and training more epochs. Further, we were able to generate good results with 85 epochs of training and if we were to train 100-150 epochs and augment more data either via blurring , rotating images ,adding jitters, we can achieve better results. Another future scope of work would be to fine tune the architecture to get better results.
Image-to-Image translation has applications in areas such as converting black and white photos and videos to colour, style transfer, season transfer, etc. In paired image-to-image translation, which is a supervised approach, each image in the source domain is mapped to the desired image in the target domain. The model is trained to learn this mapping. The architecture used for this technique is called the Pix2Pix architecture, which is a Conditional GAN architecture. In traditional GAN and DCGAN, we cannot control the class of the image generated by the generator. Conditional GANs overcome this drawback by conditioning the generator and discriminator on specific class labels. The Pix2Pix architecture is an extension of Conditional GANs in which instead of feeding a random noise vector as input to the generator, the image from the source domain is given as input. The output of the generator is the translated image, i.e. the desired image from the target domain. The discriminator, which is a conditional discriminator, is fed a pair of images as input. One image in this pair is the input image, and the other is either the real output image (i.e. the one from the dataset) or the fake output image (the one generated by the generator). The discriminator learns to classify whether the output image is real or fake. We performed paired image-to-image translation on the Dayton dataset, which is a dataset of street views and overhead views of roads in the US. We also perform a quantitative evaluation of this model using the Frechet Inception Distance (FID) and Inception Score (IS) evaluation metrics.
Conclusions from this step: In this project step we tried to tackle the paired Image-to-image translation problem which is challenging in it's nature as it often requires specialized models and loss functions for a given translation task or dataset at hand.To solve this problem we explored the Pix2Pix GAN that models the loss function using a combination of L1 Distance and Adversarial Loss with additional novelties in the design of the Generator and Discriminator that allows us to generate images that are both plausible in the content of the target domain, and is also a plausible translation of the input image. We were able to reproduce the results of the Pix2Pix architecture. The results obtained were in accordance of our evaluation metrics, FID and IS. The images that had a lower FID and high IS were also the ones that performed good in qualitative evaluation. The FID for both the transformations Street to Aerial and Aerial to Street decreased over increasing epochs and the Inception Score increased on increasing epochs. The future scope of work would be to train both the unpaired Pix2Pix networks for a batch size of 1 and observe the results. Also to optimize the runtime one can explore learning schedules.
In the unsupervised domain, Cycle-Generative Adversarial Networks (CycleGAN) is prominent and has achieved impressive results in many applications using a concept of Cycle-Consistency. Cycle-Consistency means that if an image is translated from the source domain to the target domain, and then the translated image in the target domain is translated back to the source domain, then the original source image should be obtained. In this step we implement the CycleGAN to translate images from Live Pizza Domain and Synthetic or Pre-Recorded Domain to Real Pizza Domain and vice-versa.
Although the CycleGAN framework works well, it is constrained to the shape and the level of the pixels and hence cannot remove large objects or remove irrelevant texture leading to Unrealistic Artifacts. To avoid this drawback, we propose a new loss function for CycleGAN, in which the Cycle-Consistency loss is now a linear combination of the VGG Perceptual loss or feature level loss and the pixel level consistency loss. The performance of this new framework is evaluated qualitatively on the basis of the generated images, and quantitatively using the Frechet Inception Distance (FID) and Inception Score (IS).