Abstract: This project aims to develop a deep learning-based text classification system that predicts the domain of a given article using the 20 Newsgroups dataset, which consists of news articles categorized into various topics. The goal is to classify articles into broader domains such as 'Technology,' 'Sports,' 'Politics,' and 'Religion,' based on their content. The model employs an LSTM network, a form of RNN, because it is well-suited to handle sequential data like text and capture long-term dependencies in the content. The project first preprocesses the data by tokenizing the text, padding sequences to have uniform input size, and one-hot encoding the target labels. Next, the LSTM network is trained so that it may recognize the text's patterns and features and be able to map it into a predefined category. The model was evaluated in terms of accuracy, precision, recall, and F1-score. Also, the batch size and number of epochs were readjusted according to hyperparameter tuning for increased accuracy. Through training, the model can predict the category of any unseen article. The result is mapped to its corresponding domain using a predefined dictionary. The system also maintains the functionality of saving the trained model, tokenizer, and label encoder so that the same model can easily be loaded for further predictions. This text classification system can be applied in areas such as news aggregation, content categorization, and information retrieval where automatic sorting of articles into relevant domains is required. In addition, the project explores the possibility of improving text classification using LSTM networks on domains with large unstructured text data, thereby contributing to the advancements in NLP and deep learning applications in real-world scenarios.
Keywords: Text Classification, Deep Learning, LSTM , 20 Newsgroups Dataset, Recurrent Neural Networks (RNN), Content Categorization, Tokenization Sequence Padding.
| DOI: 10.17148/IJARCCE.2024.131260