An Artificial Intelligence-Driven Framework for Text Similarity Measurement and Plagiarism Detection Using Hybrid Lexical and Semantic Analysis

DIDDE PRAVEEN KUMAR; A.N. RAMA MANI*

doi:10.17148/IJARCCE.2026.155306

← Back to VOLUME 15, ISSUE 5, MAY 2026

An Artificial Intelligence-Driven Framework for Text Similarity Measurement and Plagiarism Detection Using Hybrid Lexical and Semantic Analysis

DIDDE PRAVEEN KUMAR, A.N. RAMA MANI*

Downloads: Download PDF|DOI: 10.17148/IJARCCE.2026.155306

👁 17 views📥 5 downloads

Abstract: The proliferation of digital text and the ease of electronic copying have made academic and professional plagiarism a pervasive concern, motivating the need for detection tools that go beyond superficial string comparison. Conventional plagiarism checkers rely heavily on exact or near-exact lexical matching and consequently fail to recognize paraphrased, restructured, or semantically equivalent content. This paper proposes an artificial-intelligence-driven framework that combines lexical and semantic analysis to measure textual similarity and detect plagiarism with improved accuracy. The system couples classical term-frequency and n-gram representations with contextual embeddings produced by transformer-based language models, and fuses the two signals into a single interpretable similarity score. Candidate sources are retrieved efficiently from a reference corpus using an approximate nearest-neighbour vector index, and matched passages are highlighted in a structured report. The backend is implemented in Python, exposing services through a lightweight web framework, while a Node.js client provides document submission and report visualization. Experimental evaluation on a curated dataset of original and manipulated documents shows that the proposed fusion approach attains a precision of 0.94, a recall of 0.92, and an F1-score of 0.93, outperforming string-matching, term- frequency, and embedding-only baselines, and achieving an area under the ROC curve of 0.96. The principal contributions are a hybrid similarity-scoring methodology, an efficient retrieval-and-reporting pipeline, and a comparative empirical analysis demonstrating that semantic augmentation substantially improves the detection of disguised plagiarism.

Keywords: Plagiarism detection; Text similarity; Natural language processing; Transformer embeddings; Semantic analysis; TF-IDF; Information retrieval; Machine learning

How to Cite:

[1] DIDDE PRAVEEN KUMAR, A.N. RAMA MANI*, “An Artificial Intelligence-Driven Framework for Text Similarity Measurement and Plagiarism Detection Using Hybrid Lexical and Semantic Analysis,” International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), DOI: 10.17148/IJARCCE.2026.155306

This work is licensed under a Creative Commons Attribution 4.0 International License.