📞 +91-7667918914 | ✉️ ijarcce@gmail.com
IJARCCE Logo
International Journal of Advanced Research in Computer and Communication Engineering A monthly Peer-reviewed & Refereed journal
ISSN Online 2278-1021ISSN Print 2319-5940Since 2012
IJARCCE adheres to the suggestive parameters outlined by the University Grants Commission (UGC) for peer-reviewed journals, upholding high standards of research quality, ethical publishing, and academic excellence.
← Back to VOLUME 13, ISSUE 7, JULY 2024

Visual Question and Answering (VQA): ViT/SwinT and BERT/RoBERTA

Adarsh Pujari, Digambar Dhanagar, Milan Srinivas, Aryaman Shukla, Rishi Singh

DOI: 10.17148/IJARCCE.2024.13708

Abstract: In the study of artificial intelligence, Visual Question Answering is becoming a more important subject since it sits at the critical nexus of Computer Vision (CV) and Natural Language Processing (NLP). In the fields of CV and NLP, VQA has emerged as a major research area due to its cognitive capability. The semantic information needed for image captioning and video summarization is already present in still photos or video dynamics; it just needs to be extracted and articulated in a way that makes sense to humans. On the other hand, VQA doubles the effort linked to artificial intelligence by requiring semantic information from the same medium to be compared with the semantics implied by a query expressed in natural language. Transformers model is applied to the CV field and combines the transformers based NLP algorithm to construct a VQA system, based on a large number of actual scene photographs [1-3] on the KAGGLE platform. The results of the experiment validate the usefulness of their model by demonstrating that it can provide accurate answers in a simple and ordered setting and that there is a definite discrepancy between the generated results and the real answers in a chaotic scenario. We have considered the Issues or challenges based on VQA research as per the current scenario.

Keywords: Visual Question Answering (VQA), Computer Vision (CV), Natural Language Processing (NLP), Long Short-Term Memory (LSTM), MDETR, Issues or challenges based on VQA and VQA in Ontology.

How to Cite:

[1] Adarsh Pujari, Digambar Dhanagar, Milan Srinivas, Aryaman Shukla, Rishi Singh, “Visual Question and Answering (VQA): ViT/SwinT and BERT/RoBERTA,” International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), DOI: 10.17148/IJARCCE.2024.13708