Abstract: Image captioning, has been one of the most intriguing topics in deep learning. It incorporates the knowledge of both image processing and natural language processing. Most of the current approaches integrate the concepts of neural network. Many pre-defined convolutional neural network (CNN) models are used for extracting features of an image and bi-directional or uni-directional recurrent neural network (RNN) for sentence creation as decoder. This paper discusses about the commonly used models that are used as image encoder, such as VGG16, VGG19, Inception-V3 and InceptionResNetV2 while using the uni-directional LSTMs for the sentence generation. The comparative analysis of the result has been obtained using the BLEU score on the Flickr8k dataset.
Keywords: Image Captioning, CNN, LSTM, BLEU
| DOI: 10.17148/IJARCCE.2022.11810