Abstract: With the exponential growth in the creation and sharing of images across online platforms, there is a pressing need to develop systems that enable machines to understand and generate descriptions for these images. While humans can easily comprehend visual content, automated image captioning systems are necessary to provide meaningful descriptions for use in various applications. Extracting semantic information from photos and expressing it in natural language is the aim of image captioning. This entails closely examining photos to pinpoint important details, important things, and the connections between them. Convolutional neural networks (CNNs), in particular, are utilized in deep learning techniques to extract these visual properties. A transformer-based model is then used to process these features and provide textual captions that make sense. The approaches for picture captioning are examined in this work, with a focus on the function of transformers and CNNs in automating the creation of descriptive captions. In order to progress fields like computer vision, artificial intelligence, and human- computer interaction, the study attempts to improve machines' capacity to comprehend and describe visual content.

Keywords: Image, Caption, Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) , Long Short Term Memory(LSTM), Neural Networks.


PDF | DOI: 10.17148/IJARCCE.2025.14429

Open chat
Chat with IJARCCE