Abstract: The use of machines to perform different tasks is constantly increasing in society. Providing machines with perception can lead them to perform a great variety of tasks; even very complex ones such as elderly care. Machine perception requires that machines understand their environment and the interlocutor's intention.Thus, deep learning has the potential to improve human-machine interaction because its ability to learn features will allow machines to develop perception. And by having perception, machines will potentially provide smoother responses, drastically improving the user experience.

The process of creating a textual explanation for a set of photos is known as image captioning. In the Deep Learning arena, it has been a critical and basic endeavor. Image captioning has a wide range of uses. Image captioning is a popular research field in Artificial Intelligence as it combines the 2 major fields in Artificial Intelligence i.e., Deep Learning and Natural Language Processing. This paper presents a model that combines Natural Language Processing modules (Glove Embedding and LSTM) and Deep Learning (Feature extraction from images) to generate a sentence describing an image. The model is combined with a function that generates facts based on the primary feature in the image. Given the training image, the model is trained to maximize the likelihood of the target description sentence. Also, this has been deployed using streamlit, hosted on the web.

Keywords: Deep Learning, Artificial Intelligence, Natural Language Processing, Image Captioning, Streamlit.


PDF | DOI: 10.17148/IJARCCE.2022.115208

Open chat
Chat with IJARCCE