Abstract: Visual Question Answering (VQA) is the process of extracting the answer of the question based on the given image. Here the input is an image along with a natural language question regarding the image. The system will analyze the question and image, then extracts the answer of the question from the image. So this process is the combination of both Computer Vision (CV) and Natural Language Processing (NLP). Computer vision is used to analyze the image and NLP is required when analyzing the question and generating the answer. In VQA the answer is obtained by the mutual interaction between the image and textual vectors. Among that outer product based method between the two vectors are superior to all other. But since outer product is infeasible due to its high dimension, Multimodal Compact Bilinear Pooling (MCB) is used to efficiently combine the different features. Multimodal Compact Bilinear Pooling is one of the recent technique to perform VQA. For VQA here uses MCB twice, one for predicting the spatial attention over the images and another for combining these attentions with the question features. When applied on Visual7W dataset, this model outperforms the baseline approaches and the VQA challenge.
Keywords: Natural Language Processing, Computer Vision, Deep learning, VQA, Count-sketch projection