Abstract: Plagiarism poses a consequential challenge in academic and professional settings, requiring robust and efficient methods for detection. This study presents an innovative approach to plagiarism detection utilizing Machine Learning (ML) techniques. The proposed system leverages a diverse dataset containing both pristine and plagiarized documents, employing advanced feature extraction methods such as TF-IDF and word embeddings. The pre-processing phase involves cleaning and standardizing the text data, while feature extraction transforms documents into numerical representations felicitous for ML algorithms. Sundry ML models, including logistic regression and neural networks, are explored for their efficacy in binary relegation tasks. The system is trained on labeled datasets, distinguishing between pristine and plagiarized content. Extensive evaluations are conducted on the testing dataset, quantifying the model's precision, precision, recall, and F1-score. The study withal investigates the impact of different feature extraction techniques on the overall performance. The implementation incorporates genuine-world considerations, including the identification of variants of plagiarism, such as copy-pasting and paraphrasing. The system's adaptability to diverse domains and sources is accentuated, and scalability concerns are addressed to ascertain efficacious detection in sundry contexts.
Keywords: Paraphrase recognition, passage-level plagiarism detection, support vector machine.
| DOI: 10.17148/IJARCCE.2024.134107