Abstract: Image description generator using Deep Learning can create an image's content using properly constructed, meaningful English sentences. The user's camera or mobile phone is continuously used to capture images in real-time. Our models extract features from an image using a convolutional neural network (CNN). To create a reliable image description in English, these Features Are Provided to A Recurrent Neural Network (RNN) Or A Long Short-Term Memory (LSTM) Network. We Use A CNN To Extract Features from The Image. CNNs are the state-of-the-art methods for object recognition and detection and have been used and studied for a variety of image tasks. In more detail, we extract features from the Fc7 Layer of the VGG-16 Network that has been trained on ImageNet and is well suited for object detection for all input images. Due to computational limitations in LSTM, we first obtain a 4096-dimensional image feature vector and then reduce it using principal component analysis (PCA) to a 512-dimensional image feature vector. These characteristics are fed into the LSTM network to produce a description of the image in accurate English, which might then be converted to audio using text-to-speech technology.

Keywords- Caption Generator, Feature Extraction, LSTM, Neural Network, Object Detection

PDF | DOI: 10.17148/IJARCCE.2023.124190

Open chat
Chat with IJARCCE