Abstract: Automatic text classification is an important task of natural language processing (NLP) along with sentiment analysis, data organization, and spam filtering applications. Traditional methods based on bag-of-words (BOW) or TF-IDF methods often struggle to capture the relationship between words. This limitation can lead to misclassification, especially for short or ambiguous texts. The synergy of word embedding techniques for text enhancement. Word2Vec converts a word into number vectors that capture semantic similarities and relationships. By feeding these vectors to XGBoost, the model can use the rich semantic information to make further category predictions. Word2Vec captures the relationship between words, allowing the model to understand context and distinguish between words. Consider words like “king” and “queen” that have similar numbers, but “king” and “bank” have different numbers. This improves classification compared to traditional methods. The combination of Word2Vec and XGBoost can handle noisy or incomplete files better than traditional methods. Word2Vec’s dense representation helps reduce the impact of misspellings or inconsistent content, increasing the power of real-world applications. Additionally, XGBoost’s ability to handle missing values and focus on the most important features improves model interpretation. The framework can be extended to multiple registration functions, making it adaptable to a wide range of text challenges. Finally, XGBoost’s scalability ensures that the method can be effectively applied to large datasets without sacrificing performance
Index Terms: XGBoost, Word2Vec, text categorization, semantic relationships, NLP, gradient boosting, word embeddings, accuracy, machine learning, document classification.
|
DOI:
10.17148/IJARCCE.2025.14459