Persian Plagdet 2016 A Text Alignment Corpus for Persian Plagiarism Detection Fatemeh Mashhadirajab, Mehrnoush Shamsfard Fatemeh Shafiee, Chakaveh Saedi Razieh Adelkhah, NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
2 /29 Outline Introduction Text Alignment Corpus Construction Strategies For Plagiarisms Types Dataset Statistics Conclusions References NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
3 /29 Introduction A taxonomy of plagiarism [2]
4 /29 Text Alignment Corpus Construction Data Source Preparation Documents Clustering Set of Suspicious and source Documents source and suspicious document pairs selection Source Documents Segmentation Segment Extraction Segment Obfuscation Obfuscated Segment Insertion NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
5 /29 Text Alignment Corpus Construction Data Source Preparation articles or theses in the fields of computer science and engineering & electrical engineering o 4,500 documents from Wikipedia articles o 1,500 documents from articles and theses available from online stores NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
5 /29 Text Alignment Corpus Construction Data Source Preparation Our corpus contains 11,089 documents NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
6 /29 Text Alignment Corpus Construction Documents Clustering NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
7 /29 Text Alignment Corpus Construction Set of Suspicious and source Documents NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
8 /29 Text Alignment Corpus Construction Suspicious Documents Source Documents Set of Suspicious and source Documents They are randomly selected from each cluster NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
9 /29 Text Alignment Corpus Construction source and suspicious document Suspicious Documents Source Documents pairs selection NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
9 /29 Text Alignment Corpus Construction source and suspicious Suspicious Documents Source Documents susp document pairs selection Similarity Detection system if the similarity < 50% NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
9 /29 Text Alignment Corpus Construction source and suspicious Suspicious Documents Source Documents susp src document pairs selection Similarity Detection system if the similarity < 50% NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
10/29 Text Alignment Corpus Construction Source Documents Segmentation Suspicious Documents Source Documents src Similarity Detection system if the similarity < 50% NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
11/29 Text Alignment Corpus Construction Segment Extraction Suspicious Documents Source Documents Similarity Detection system src if the similarity < 50% NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
12/29 Text Alignment Corpus Construction Segment Obfuscation Suspicious Documents Source Documents Similarity Detection system src if the Segment similarity < 50% Obfuscation NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
13/29 Text Alignment Corpus Construction Obfuscated Segment Insertion Suspicious Documents Source Documents susp Similarity Detection system src if the Segment similarity < 50% Obfuscation NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
14/29 Strategies For Plagiarisms Types Exact Copy Near Copy Modified Copy Text Manipulation (Paraphrasing) Text Manipulation (Summarizing) Automatic Translation Manual Translation Cyclic Translation Idea Adoption (semantic-based meaning) NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
15/29 Strategies For Plagiarisms Types Exact Copy Segment Obfuscation src susp NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
16/29 Strategies For Plagiarisms Types Near Copy Segment Obfuscation src susp Insertion deletion substitution sentence split or join NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
17/29 Strategies For Plagiarisms Types Modified Copy Segment Obfuscation src susp the Persian sentence understanding and generation system introduced by Adelkhah et al. [7] semantic representation sentence production based on semantic (sentence understanding) representation (sentence generation) NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
18/29 Strategies For Plagiarisms Types Text Manipulation (Paraphrasing) Segment Obfuscation src susp the Persian sentence understanding and generation system introduced by Adelkhah et al. [7] Each word is replaced with a synonym retrieved from FarsNet or FavaNet NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
19/29 Strategies For Plagiarisms Types Text Manipulation (Summarizing) Segment Obfuscation src susp Persian summarizer introduced by Shafiee et al. [6] NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
20/29 Strategies For Plagiarisms Types Automatic Translation Segment Obfuscation src susp Google translate Spell checker Hunspell Persian to English Persian English NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
21/29 Strategies For Plagiarisms Types Manual Translation Segment Obfuscation Translateion src susp Persian English NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
22/29 Strategies For Plagiarisms Types Cyclic Translation Segment Obfuscation Negar spell checker src susp Google translate Hunspell English Persian Google translate NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
23/29 Strategies For Plagiarisms Types Idea Adoption (semantic-based meaning) Segment Obfuscation src susp NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
24/29 Dataset Statistics documents 11089 plagiarism cases 11603 Document purpose languages fa source documents 48% suspicious documents with plagiarism 28% w/o plagiarism 24% Document length short (<10 pages 1 ) 64% medium (10-100 pages) 35% long (>100 pages) 1% Plagiarism per document hardly (<20%) 25% medium (20%-50%) 20% much (50%-80%) 26% entirely (>80%) 29% Case length short (<1k characters) 37% medium (1k-3k characters) 55% long (>3k characters) 8% NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
25/29 Conclusion This article describes a methodology for building a Persian corpus for evaluating plagiarism detection systems. This corpus is in PAN format. This corpus is a variety of plagiarism types in large volume are created. To produce this corpus, the focus is on the simulation of different types of plagiarism Different strategies are employed to create obfuscation in each plagiarism category NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
26/29 References 1. Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Potthast, M. and Rosso, P. 2016. Overview of the PAN@FIRE2016 Shared Task on Persian Plagiarism Detection and Text Alignment Corpus Construction, Notebook Papers of FIRE 2016, FIRE-2016, CEUR-WS.org. 2. Alzahrani, M., Salim, N. and Abraham, A. 2012. Understanding plagiarism linguistic patterns, Textual features, and detection Methods. IEEE Trans. SYSTEMS, MAN, AND CYBERNETICS — PART C: APPLICATIONS AND REVIEWS , vol. 42, no. 2. 3. Potthast, M., Stein, B. and et.al. 2010. An Evaluation Framework for Plagiarism Detection. Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010 Beijing,_c ACL . 4. Shamsfard, M., and Kiani, S., and Shahedi, Y. STeP-1: Standard Text Preparation for Persian Language. CAASL3 Third Workshop on Computational Approaches to Arabic Script- Languages . 5. Makrehchi, M. and Kamel, M. 2004. A fuzzy set approach to extracting keywords from abstracts. North American Fuzzy Information Processing Society- NAFIPS 2003 , Banf, Canada. 6. Shafiee, F. and Shamsfard, M. 2015. The automatic Persian summarizer. The 20st Computer Society of Iran computer conference. NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
Recommend
More recommend