Event:"Plagiarism Detection: A focus on the Intrinsic Approach and the Evaluation on Arabic Language" By Imene BENSALEM

Information

"Plagiarism Detection: A focus on the Intrinsic Approach and the Evaluation on Arabic Language" By Imene BENSALEM

Date :29 February 2020

Place :Salle de soutenance à la faculté des NTICs

Key words

Natural language processing Intrinsic Plagiarism Detection Arabic plagiarism detection Character n-grams Stylistic analysis Evaluation corpora

Description

Defended on 29/02/2020 in front of the scientific committee members:

President Prof Prof.Ramdane Maamri Constantine 2 University

Local Advisor Prof. Salim Chikhi Constantine 2 University

Invited: External Advisor Prof. Paolo Rosso Universitat Politècnica de València (Spain)

Examiners Prof. Alberto Barrón-Cedeño Università di Bologna (Italy)
Prof. Yacine Lafifi Guelma University
Dr. Sihem Mostefai Constantine 2 University

ABSTRACT OF THE THESIS

This thesis deals with two major topics: plagiarism detection in Arabic documents, and plagiarism detection based on the writing style changes in the suspicious document, which is called intrinsic plagiarism detection. This approach is an alternative to the text-matching approach, notably, in the absence of the plagiarism source. Our key contributions in these two areas lie first, in the development of Arabic corpora to allow for the evaluation of plagiarism detection software on this language and, second, in the development of a language-independent intrinsic plagiarism detection method that exploits the character n-grams in a machine learning approach while avoiding the curse of dimensionality. Representing texts with character n-grams is one of the most successful text modelling approaches to some stylistic analysis applications. However, studies on the best character n-grams in the context of intrinsic plagiarism detection are almost non-existent. Hence, our third key contribution is an attempt to narrow this gap by investigating which character n-grams, in terms of their frequency and length, are the best to detect plagiarism intrinsically. We carried out our experiments on standardised English corpora and also on the developed Arabic corpora using the method we developed and one of the most prominent intrinsic plagiarism detection methods. The findings of our analysis can be exploited by the future intrinsic plagiarism detection methods that use character n-grams.
In addition to the above-mentioned technical contributions, we provide the reader with comprehensive and critical surveys of the literature of Arabic plagiarism detection and intrinsic plagiarism detection, which were lacking in both topics.