"Plagiarism Detection: A focus on the Intrinsic Approach and the Evaluation on Arabic Language" By Imene BENSALEM

Date : 29 February 2020
Place : Salle de soutenance à la faculté des NTICs
event.organized-by-team SCAL
Key words
Natural language processing Intrinsic Plagiarism Detection Arabic plagiarism detection Character n-grams Stylistic analysis Evaluation corpora

Defended on 29/02/2020    in front of the scientific committee members:

President  Prof                  Prof.Ramdane Maamri         Constantine 2 University

Local Advisor                    Prof. Salim Chikhi                 Constantine 2 University

Invited: External Advisor  Prof. Paolo Rosso                 Universitat Politècnica de València (Spain)

Examiners                       Prof. Alberto Barrón-Cedeño  Università di Bologna (Italy)
                                        Prof. Yacine Lafifi                    Guelma University
                                        Dr. Sihem Mostefai                 Constantine 2 University


This thesis deals with two major topics: plagiarism detection in Arabic documents, and plagiarism detection based on the writing style changes in the suspicious document, which is called intrinsic plagiarism detection. This approach is an alternative to the text-matching approach, notably, in the absence of the plagiarism source. Our key contributions in these two areas lie first, in the development of Arabic corpora to allow for the evaluation of plagiarism detection software on this language and, second, in the development of a language-independent intrinsic plagiarism detection method that exploits the character n-grams in a machine learning approach while avoiding the curse of dimensionality. Representing texts with character n-grams is one of the most successful text modelling approaches to some stylistic analysis applications. However, studies on the best character n-grams in the context of intrinsic plagiarism detection are almost non-existent. Hence, our third key contribution is an attempt to narrow this gap by investigating which character n-grams, in terms of their frequency and length, are the best to detect plagiarism intrinsically. We carried out our experiments on standardised English corpora and also on the developed Arabic corpora using the method we developed and one of the most prominent intrinsic plagiarism detection methods. The findings of our analysis can be exploited by the future intrinsic plagiarism detection methods that use character n-grams.
In addition to the above-mentioned technical contributions, we provide the reader with comprehensive and critical surveys of the literature of Arabic plagiarism detection and intrinsic plagiarism detection, which were lacking in both topics.