Defended on 29/02/2020 in front of the scientific committee members:
President Prof Prof.Ramdane Maamri Constantine 2 University
Local Advisor Prof. Salim Chikhi Constantine 2 University
Invited: External Advisor Prof. Paolo Rosso Universitat Politècnica de València (Spain)
Examiners Prof. Alberto Barrón-Cedeño Università di Bologna (Italy)
Prof. Yacine Lafifi Guelma University
Dr. Sihem Mostefai Constantine 2 University
ABSTRACT OF THE THESIS
This thesis deals with two major topics: plagiarism detection in Arabic documents, and plagiarism detection based on the writing style changes in the suspicious document, which is called intrinsic plagiarism detection. This approach is an alternative to the text-matching approach, notably, in the absence of the plagiarism source. Our key contributions in these two areas lie first, in the development of Arabic corpora to allow for the evaluation of plagiarism detection software on this language and, second, in the development of a language-independent intrinsic plagiarism detection method that exploits the character n-grams in a machine learning approach while avoiding the curse of dimensionality. Representing texts with character n-grams is one of the most successful text modelling approaches to some stylistic analysis applications. However, studies on the best character n-grams in the context of intrinsic plagiarism detection are almost non-existent. Hence, our third key contribution is an attempt to narrow this gap by investigating which character n-grams, in terms of their frequency and length, are the best to detect plagiarism intrinsically. We carried out our experiments on standardised English corpora and also on the developed Arabic corpora using the method we developed and one of the most prominent intrinsic plagiarism detection methods. The findings of our analysis can be exploited by the future intrinsic plagiarism detection methods that use character n-grams.
In addition to the above-mentioned technical contributions, we provide the reader with comprehensive and critical surveys of the literature of Arabic plagiarism detection and intrinsic plagiarism detection, which were lacking in both topics.