ImageCaptioning

Attention mechanism is at the ground level of image captioning models.
Models are made of an encoder and decoder architecture.
Encoder is generates image vectors from the given images using convolutional neural networks (E.g. VGG16, InceptionV3, Resnet50, etc. )
Recurrent neural networks (RNNs) are used as decoders. (E.g. Long Short Term Memory (LSTM) and Gradient Recurrent Unit (GRU)).

Model:-

Here we have used Inception V3 as encoder and GRU decoder.

Here, features from the lower convolutional layer of InceptionV3 are extracted giving us a vector of shape (8, 8, 2048).
Squash that to a shape of (64, 2048).
This vector is then passed through the CNN Encoder (which consists of a single Fully connected layer).
The RNN (here GRU) attends over the image to predict the next word.
The model was trained on a subset of the Coco2017 Dataset for 100 epochs.

This is a sample result :-

For more results refer results

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
ImageCaptioningArchitecture.jpg		ImageCaptioningArchitecture.jpg
ImageCaptioning_results.pdf		ImageCaptioning_results.pdf
LICENSE		LICENSE
README.md		README.md
result.jpg		result.jpg