Hybrid Embedding Model for Document Classification

Ranjana S. Chakrasali; Chandana A. Athreyesa; K. Shridevi B. Adiga; Vathsala; Vishnupriya

doi:10.17148/IJARCCE.2026.155165

← Back to VOLUME 15, ISSUE 5, MAY 2026

Hybrid Embedding Model for Document Classification

Ranjana S. Chakrasali, Chandana A. Athreyesa, K. Shridevi B. Adiga, Vathsala, Vishnupriya

Downloads: Download PDF|DOI: 10.17148/IJARCCE.2026.155165

👁 6 views📥 2 downloads

Abstract: Managing large collections of digital documents has become increasingly difficult in academic and professional environments. Files such as research papers, reports, PDFs, and project documents are often stored without proper organization, making retrieval slow and inefficient. This work proposes a hybrid document classification framework that combines TF-IDF statistical features with contextual embeddings generated using BERT. The combined representation helps the model capture both important keywords and semantic meaning from documents. A lightweight classification layer is used to assign uploaded files into categories such as Business, Politics, Sports, Health, and Technology. In addition, a rule-based file extension classifier is integrated to improve efficiency for commonly identifiable file types. A Flask-based web interface enables users to upload documents and automatically organize them into category folders. Experimental evaluation on the BBC News dataset demonstrates that the proposed hybrid model performs better than standalone TF-IDF and BERT models in terms of classification accuracy and Macro F1-score.

Keywords: document classification, hybrid embedding, TF-IDF, BERT, natural language processing, feature fusion, Flask, text categorization

How to Cite:

[1] Ranjana S. Chakrasali, Chandana A. Athreyesa, K. Shridevi B. Adiga, Vathsala, Vishnupriya, “Hybrid Embedding Model for Document Classification,” International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), DOI: 10.17148/IJARCCE.2026.155165

This work is licensed under a Creative Commons Attribution 4.0 International License.