πŸ“ž +91-7667918914 | βœ‰οΈ ijarcce@gmail.com
International Journal of Advanced Research in Computer and Communication Engineering
International Journal of Advanced Research in Computer and Communication Engineering A monthly Peer-reviewed & Refereed journal
ISSN Online 2278-1021ISSN Print 2319-5940Since 2012
IJARCCE adheres to the suggestive parameters outlined by the University Grants Commission (UGC) for peer-reviewed journals, upholding high standards of research quality, ethical publishing, and academic excellence.
← Back to VOLUME 15, ISSUE 5, MAY 2026

An Artificial Intelligence-Driven Framework for Text Similarity Measurement and Plagiarism Detection Using Hybrid Lexical and Semantic Analysis

DIDDE PRAVEEN KUMAR, A.N. RAMA MANI*

πŸ‘ 7 viewsπŸ“₯ 1 download
Share: 𝕏 f in ✈ βœ‰
Abstract: The proliferation of digital text and the ease of electronic copying have made academic and professional plagiarism a pervasive concern, motivating the need for detection tools that go beyond superficial string comparison. Conventional plagiarism checkers rely heavily on exact or near-exact lexical matching and consequently fail to recognize paraphrased, restructured, or semantically equivalent content. This paper proposes an artificial-intelligence-driven framework that combines lexical and semantic analysis to measure textual similarity and detect plagiarism with improved accuracy. The system couples classical term-frequency and n-gram representations with contextual embeddings produced by transformer-based language models, and fuses the two signals into a single interpretable similarity score. Candidate sources are retrieved efficiently from a reference corpus using an approximate nearest-neighbour vector index, and matched passages are highlighted in a structured report. The backend is implemented in Python, exposing services through a lightweight web framework, while a Node.js client provides document submission and report visualization. Experimental evaluation on a curated dataset of original and manipulated documents shows that the proposed fusion approach attains a precision of 0.94, a recall of 0.92, and an F1-score of 0.93, outperforming string-matching, term- frequency, and embedding-only baselines, and achieving an area under the ROC curve of 0.96. The principal contributions are a hybrid similarity-scoring methodology, an efficient retrieval-and-reporting pipeline, and a comparative empirical analysis demonstrating that semantic augmentation substantially improves the detection of disguised plagiarism.

Keywords: Plagiarism detection; Text similarity; Natural language processing; Transformer embeddings; Semantic analysis; TF-IDF; Information retrieval; Machine learning

How to Cite:

[1] DIDDE PRAVEEN KUMAR, A.N. RAMA MANI*, β€œAn Artificial Intelligence-Driven Framework for Text Similarity Measurement and Plagiarism Detection Using Hybrid Lexical and Semantic Analysis,” International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), DOI: 10.17148/IJARCCE.2026.155306

Creative Commons License This work is licensed under a Creative Commons Attribution 4.0 International License.