📞 +91-7667918914 | âœ‰ī¸ ijarcce@gmail.com
International Journal of Advanced Research in Computer and Communication Engineering
International Journal of Advanced Research in Computer and Communication Engineering A monthly Peer-reviewed & Refereed journal
ISSN Online 2278-1021ISSN Print 2319-5940Since 2012
IJARCCE adheres to the suggestive parameters outlined by the University Grants Commission (UGC) for peer-reviewed journals, upholding high standards of research quality, ethical publishing, and academic excellence.
← Back to VOLUME 15, ISSUE 5, MAY 2026

A Modular Data Deduplication Framework with Support for Multi-Format Document Analysis

Shreya Modi, Sreenivasa M, Shridevi, Shreya Y.P, Trisha Prameela Y

👁 3 viewsđŸ“Ĩ 2 downloads
Share: 𝕏 f in ✈ ✉
Abstract: Data deduplication is a critical technique used to reduce repeated storage of documents in digital systems. In many organizational and academic environments, identical files or portions of files are stored multiple times, leading to increased storage consumption and management complexity. This research presents a comprehensive document-based deduplication framework that provides native support for PDF, DOCX, and plain text file formats. The proposed system processes uploaded documents by extracting textual content and systematically dividing it into smaller, manageable chunks. A cryptographic hash-based comparison methodology is employed to determine whether content segments already exist within the storage repository. When duplicate content is identified, the system maintains reference pointers rather than storing redundant copies, thereby achieving significant storage optimization. Experimental evaluation demonstrates that the chunk-level approach successfully identifies partial duplicates that would be missed by traditional file-level comparison methods. The framework is designed for practical deployment in academic institutions and organizational document management systems where efficient handling of duplicate content is essential.

Keywords: Data deduplication, document analysis, hashbased comparison, chunk-level processing, storage optimization, multi-format support.

How to Cite:

[1] Shreya Modi, Sreenivasa M, Shridevi, Shreya Y.P, Trisha Prameela Y, “A Modular Data Deduplication Framework with Support for Multi-Format Document Analysis,” International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), DOI: 10.17148/IJARCCE.2026.155154

Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 International License.