This project aims at generating novel, natural language captions of input images, using an encoder-decoder architecture that uses a CNN as image encoder and an LSTM RNN as a feature decoder. The project is implemented using Tensorflow, and is roughly based on Show, Attend and Tell: Neural Image Caption Generation with Visual Attention paper by Kelvin Xu et al. All the code has been thoroughly commented on for ease of understanding. Schematic below roughly shows the structure of the training graph ('Z' denotes the 'Context Vector'):
- Uses a multi-threaded input pipeline implemented using Tensorflow Queues, producing a fast, steady stream of inputs from multiple TFRecord files(shards), with rigorous shuffling of data for data augmentation.
- Uses Google AutoML Project's NASNet architecture to extract visual features from images. NASNet currently has highest recorded accuracy on ImageNetLSVRC2012 data set.
- Trained on MS COCO 2017 Training data set. The final pre-processed data consists of 5,87,605 image-caption pairs. The vocabulary used consists of 10,204 words(words occuring >=5 times in MSCOCO captions data).
- The LSTM RNN is combined with Soft Attention Mechanism, which computes attention weights, applies them to image features, and produces a "context vector", which is fed into LSTM as an additional input along with hidden states. This gives our LSTM more contextual information at every time-step for producing higher quality captions while decoding.
- Added L2 regularization to all fully-connected layers and applied drop-out ratio of 0.5 on LSTM and FC Layers to prevent over-fitting.
- The image-caption pairs from data set were added to SequenceExample protocol buffers and then written to files with TFRecord file format. This helps in faster, asynchronous reading of data by input pipeline.
To Do: Use Beam Search decoder with LSTM instead of Greedy decoder for greater quality captions.
The model achieves following scores on COCO 2014 Validation and Test data:
- BLUE1: 70.07
- BLUE2: 52.88
- BLUE3: 37.95
- BLUE4: 26.68
- Meteor: 23.61
- Rogue-L: 51.29
- CIDEr: 84.81
An Android application was also developed that sends the images to the online Flask server, which runs the frozen inference graph and returns the captions to the client. The captions generated for the COCO test and validation images are usually relevant to the images. However, I wanted to know how it performs on random images that are NOT a part of the MS COCO test and validation data set. Below are the examples of captions generated using the Android application on some such images:
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio.
- Show, Attend and Tell slides
- Attention Mechanism Blog Post
- Interpreting, Training, and Distilling Seq2Seq Models, Alexander Rush (@harvardnlp)
- NASNet Architecture and pre-trained checkpoints
- The original paper implementation in Theano
- Google's implementation of "Show and Tell: A Neural Image Caption Generator"