Abstract: In the study of artificial intelligence, Visual Question Answering is becoming a more important subject since it sits at the critical nexus of Computer Vision (CV) and Natural Language Processing (NLP). In the fields of CV and NLP, VQA has emerged as a major research area due to its cognitive capability. The semantic information needed for image captioning and video summarization is already present in still photos or video dynamics; it just needs to be extracted and articulated in a way that makes sense to humans. On the other hand, VQA doubles the effort linked to artificial intelligence by requiring semantic information from the same medium to be compared with the semantics implied by a query expressed in natural language. Transformers model is applied to the CV field and combines the transformers based NLP algorithm to construct a VQA system, based on a large number of actual scene photographs [1-3] on the KAGGLE platform. The results of the experiment validate the usefulness of their model by demonstrating that it can provide accurate answers in a simple and ordered setting and that there is a definite discrepancy between the generated results and the real answers in a chaotic scenario. We have considered the Issues or challenges based on VQA research as per the current scenario.
Keywords: Visual Question Answering (VQA), Computer Vision (CV), Natural Language Processing (NLP), Long Short-Term Memory (LSTM), MDETR, Issues or challenges based on VQA and VQA in Ontology.
| DOI: 10.17148/IJARCCE.2024.13708