Abstract: Image captioning involves automatically describing images using words, attracting attention from researchers in natural language processing (NLP) and computer vision. Recent advancements primarily adopt an encoder-decoder framework, utilizing convolutional neural networks (CNNs) to extract image features and decoders to generate descriptions. Integration of attention mechanisms into this framework has notably improved performance. Leveraging the Transformer model, known for its effectiveness and efficiency in NLP tasks due to its attention mechanisms, we propose a novel approach combining CNNs and Transformers for image captioning. Our model utilizes a Transformer-Encoder to extract refined image feature representations, enabling the Transformer-Decoder to focus on pertinent image details when generating captions. Additionally, adaptive attention in the Transformer-Decoder determines the optimal utilization of image information during caption generation. Through extensive training on the Flickr8K_dataset, our model achieves an impressive 86.21% accuracy, demonstrating its efficancy and value in image captioning tasks.
Keywords: Image Caption, CNN, Deep Learning, Transformer, Attention mechanism,Flickr8k dataset.
| DOI: 10.17148/IJARCCE.2024.13469