Abstract: Visual Question Answering (VQA) is a complex multimodal task that requires instant understanding of visual content and natural language queries, yet traditional models often struggle to construct a complete semantic representation of the same. Although conventional VQA systems rely on deep visual feature extraction and linguistic encoders for the question, they commonly fail to capture global context, exact object interactions, and long-range dependencies. A major limitation across early VQA models is the presence of strong language bias, where the system predicts answers based on frequently occurring question-answer patterns rather than genuine visual grounding. To address these issues, recent research has introduced image captioning as a complementary semantic modality capable of enriching scene understanding. Captions provide descriptive information about object attributes, relationships, and contextual cues that may be missing or underrepresented in raw visual features, and integrating them through attention mechanisms such as Attention Aware modules or Question-Guided Parallel Attention allows models to filter irrelevant tokens and retain meaningful semantics. This fused representation creates a more robust and contextually aligned multimodal embedding that strengthens reasoning across diverse question types. Experimental results on benchmark datasets show that caption-enhanced approaches offer consistent improvements in accuracy and interpretability, although they remain dependent on caption quality and introduce additional computational complexity. Nonetheless, the integration of caption-generated semantics represents a promising direction toward developing more context-aware and visually grounded VQA systems capable of more reliable and human-like reasoning.

Keywords: Visual Question Answering, Attention Aware, Question-Guided Parallel Attention, Image Captioning, Deep Learning, VQA v1,VQA v2.


Downloads: PDF | DOI: 10.17148/IJARCCE.2025.141219

How to Cite:

[1] Sarah Jose, Goutham Krishna L U, "A Review on Visual Question Answering By Image Captioning," International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), DOI: 10.17148/IJARCCE.2025.141219

Open chat
Chat with IJARCCE