Image Caption Generation using CNN LSTM Encoder Decoder

Published:

Md Mesbahur Rahman

Description: Image caption generation is a widely used application of sequential generative model. In this project, I designed and trained a CNN-LSTM encoder-decoder architecture for generating caption from an input image. I did this project as part of the requirement of gaduating 'Computer Vision Nanodegree' from Udacity.

My contribution: Pre-processed the images in the MS COCO Dataset using PyTorch Transforms and converted the captions in the training set into sequence of integers using BOW vocabulary dictionary with a vocabulary threshold of 5. Defined and trained a CNN encoder and a LSTM Decoder on top of a time distributed embedding layer by using pretrained RESNET50 model as a feature extractor to encode an input image into a fixed embed sized vector and then used LSTM decoder to generate captions from the output embedding vector of the CNN encoder. Configurations of the data pre-processing and CNN encoder and LSTM decoder were inspired from this paper. Then inference was done on the 'test' portion of the MS COCO dataset.

Resources: [Code]