A Text Alignment Corpus for Persian Plagiarism Detection Fatemeh - PowerPoint PPT Presentation

Persian Plagdet 2016 A Text Alignment Corpus for Persian Plagiarism Detection Fatemeh Mashhadirajab, Mehrnoush Shamsfard Fatemeh Shafiee, Chakaveh Saedi Razieh Adelkhah, NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

2 /29 Outline  Introduction  Text Alignment Corpus Construction  Strategies For Plagiarisms Types  Dataset Statistics  Conclusions  References NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

3 /29 Introduction A taxonomy of plagiarism [2]

4 /29 Text Alignment Corpus Construction  Data Source Preparation  Documents Clustering  Set of Suspicious and source Documents  source and suspicious document pairs selection  Source Documents Segmentation  Segment Extraction  Segment Obfuscation  Obfuscated Segment Insertion NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

5 /29 Text Alignment Corpus Construction Data Source Preparation   articles or theses in the fields of computer science and engineering & electrical engineering o 4,500 documents from Wikipedia articles o 1,500 documents from articles and theses available from online stores NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

5 /29 Text Alignment Corpus Construction Data Source Preparation   Our corpus contains 11,089 documents NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

6 /29 Text Alignment Corpus Construction  Documents Clustering NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

7 /29 Text Alignment Corpus Construction  Set of Suspicious and source Documents NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

8 /29 Text Alignment Corpus Construction Suspicious Documents Source Documents  Set of Suspicious and source Documents They are randomly selected from each cluster NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

9 /29 Text Alignment Corpus Construction  source and suspicious document Suspicious Documents Source Documents pairs selection NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

9 /29 Text Alignment Corpus Construction  source and suspicious Suspicious Documents Source Documents susp document pairs selection Similarity Detection system if the similarity < 50% NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

9 /29 Text Alignment Corpus Construction  source and suspicious Suspicious Documents Source Documents susp src document pairs selection Similarity Detection system if the similarity < 50% NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

10/29 Text Alignment Corpus Construction  Source Documents Segmentation Suspicious Documents Source Documents src Similarity Detection system if the similarity < 50% NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

11/29 Text Alignment Corpus Construction  Segment Extraction Suspicious Documents Source Documents Similarity Detection system src if the similarity < 50% NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

12/29 Text Alignment Corpus Construction  Segment Obfuscation Suspicious Documents Source Documents Similarity Detection system src if the Segment similarity < 50% Obfuscation NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

13/29 Text Alignment Corpus Construction  Obfuscated Segment Insertion Suspicious Documents Source Documents susp Similarity Detection system src if the Segment similarity < 50% Obfuscation NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

14/29 Strategies For Plagiarisms Types  Exact Copy  Near Copy  Modified Copy  Text Manipulation (Paraphrasing)  Text Manipulation (Summarizing)  Automatic Translation  Manual Translation  Cyclic Translation  Idea Adoption (semantic-based meaning) NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

15/29 Strategies For Plagiarisms Types  Exact Copy Segment Obfuscation src susp NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

16/29 Strategies For Plagiarisms Types  Near Copy Segment Obfuscation src susp Insertion deletion substitution sentence split or join NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

17/29 Strategies For Plagiarisms Types  Modified Copy Segment Obfuscation src susp the Persian sentence understanding and generation system introduced by Adelkhah et al. [7] semantic representation sentence production based on semantic (sentence understanding) representation (sentence generation) NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

18/29 Strategies For Plagiarisms Types  Text Manipulation (Paraphrasing) Segment Obfuscation src susp the Persian sentence understanding and generation system introduced by Adelkhah et al. [7] Each word is replaced with a synonym retrieved from FarsNet or FavaNet NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

19/29 Strategies For Plagiarisms Types Text Manipulation (Summarizing)  Segment Obfuscation src susp Persian summarizer introduced by Shafiee et al. [6] NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

20/29 Strategies For Plagiarisms Types  Automatic Translation Segment Obfuscation src susp Google translate Spell checker Hunspell Persian to English Persian English NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

21/29 Strategies For Plagiarisms Types  Manual Translation Segment Obfuscation Translateion src susp Persian English NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

22/29 Strategies For Plagiarisms Types  Cyclic Translation Segment Obfuscation Negar spell checker src susp Google translate Hunspell English Persian Google translate NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

23/29 Strategies For Plagiarisms Types  Idea Adoption (semantic-based meaning) Segment Obfuscation src susp NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

24/29 Dataset Statistics documents 11089 plagiarism cases 11603 Document purpose languages fa source documents 48% suspicious documents with plagiarism 28% w/o plagiarism 24% Document length short (<10 pages 1 ) 64% medium (10-100 pages) 35% long (>100 pages) 1% Plagiarism per document hardly (<20%) 25% medium (20%-50%) 20% much (50%-80%) 26% entirely (>80%) 29% Case length short (<1k characters) 37% medium (1k-3k characters) 55% long (>3k characters) 8% NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

25/29 Conclusion  This article describes a methodology for building a Persian corpus for evaluating plagiarism detection systems.  This corpus is in PAN format.  This corpus is a variety of plagiarism types in large volume are created.  To produce this corpus, the focus is on the simulation of different types of plagiarism  Different strategies are employed to create obfuscation in each plagiarism category NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

26/29 References 1. Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Potthast, M. and Rosso, P. 2016. Overview of the PAN@FIRE2016 Shared Task on Persian Plagiarism Detection and Text Alignment Corpus Construction, Notebook Papers of FIRE 2016, FIRE-2016, CEUR-WS.org. 2. Alzahrani, M., Salim, N. and Abraham, A. 2012. Understanding plagiarism linguistic patterns, Textual features, and detection Methods. IEEE Trans. SYSTEMS, MAN, AND CYBERNETICS — PART C: APPLICATIONS AND REVIEWS , vol. 42, no. 2. 3. Potthast, M., Stein, B. and et.al. 2010. An Evaluation Framework for Plagiarism Detection. Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010 Beijing,_c ACL . 4. Shamsfard, M., and Kiani, S., and Shahedi, Y. STeP-1: Standard Text Preparation for Persian Language. CAASL3 Third Workshop on Computational Approaches to Arabic Script- Languages . 5. Makrehchi, M. and Kamel, M. 2004. A fuzzy set approach to extracting keywords from abstracts. North American Fuzzy Information Processing Society- NAFIPS 2003 , Banf, Canada. 6. Shafiee, F. and Shamsfard, M. 2015. The automatic Persian summarizer. The 20st Computer Society of Iran computer conference. NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

A Text Alignment Corpus for Persian Plagiarism Detection Fatemeh - PowerPoint PPT Presentation

Persian Plagdet 2016 A Text Alignment Corpus for Persian Plagiarism Detection Fatemeh Mashhadirajab, Mehrnoush Shamsfard Fatemeh Shafiee, Chakaveh Saedi Razieh Adelkhah, NLP Research Lab, Faculty of Computer Science and Engineering, Shahid

07.01.2011 Topics Plagiarism Detection Software 2010 Plagiarism Plagiarism Detection

Instructor-Centric Source Code Plagiarism Detection and Plagiarism Corpus Jonathan Y. H. Poon,

WHAT IS PLAGIARISM? According to plagiarism.org, following to be plagiarism: To submit

plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 Plagiarism detection problem

Middle East Chapters 19-20 The Persian Gulf and Interior The Eastern Mediterranean The Persian

Herodotus and the Persian Wars Herodotus and the Persian Wars Herodotus is the first true

Greece & Persia ~The Persian Wars~ Cyrus the Great Led a Persian revolt against the

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Introduction

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n gram Profiles

External Plagiarism Detection using Information Retrieval and Sequence Alignment Rao Muhammad

Stylometry in plagiarism detection and author profiling Paolo Rosso PRHLT Research Center

Who idea is it? Acknowledging and building on other work, or just plain plagiarism. Allison Mann

Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 2011 Results [pan.webis.de]

Plagiarism Detection in Open Access Publications Jens Brandt, Martin Gutbrod, Oliver Wellnitz,

The Persian Empire The Conquerors of Everyone Start of the Persian Empire Starts with

Model-Based Development To develop complex software systems Model Validate Refine

Identifying successful features in extended definitions from Chemistry: A corpus study

Formally Specified Computer Algebra Software - DK10 Muhammad Taimoor Khan Supervisor: Prof.

Retargetable Compilers System on Chip Many different types of DSPs and embedded processors

Intermediate representations of functional programming languages for software quality control

A Study In Hebrew Paraphrase Identification Thesis Presentation Submitted by Gabriel Stanovsky

Aylin nald FOAI 11 Thank you FOAI for bringing together the assessment professionals for

Modelling and Analysis of Traffic Networks Based on Graph Transformation Juan de Lara E.T.S. de

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

A Text Alignment Corpus for Persian Plagiarism Detection Fatemeh - PowerPoint PPT Presentation

Persian Plagdet 2016 A Text Alignment Corpus for Persian Plagiarism Detection Fatemeh Mashhadirajab, Mehrnoush Shamsfard Fatemeh Shafiee, Chakaveh Saedi Razieh Adelkhah, NLP Research Lab, Faculty of Computer Science and Engineering, Shahid

07.01.2011 Topics Plagiarism Detection Software 2010 Plagiarism Plagiarism Detection

Instructor-Centric Source Code Plagiarism Detection and Plagiarism Corpus Jonathan Y. H. Poon,

WHAT IS PLAGIARISM? According to plagiarism.org, following to be plagiarism: To submit

plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 Plagiarism detection problem

Middle East Chapters 19-20 The Persian Gulf and Interior The Eastern Mediterranean The Persian

Herodotus and the Persian Wars Herodotus and the Persian Wars Herodotus is the first true

Greece &amp; Persia ~The Persian Wars~ Cyrus the Great Led a Persian revolt against the

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Introduction

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n gram Profiles

External Plagiarism Detection using Information Retrieval and Sequence Alignment Rao Muhammad

Stylometry in plagiarism detection and author profiling Paolo Rosso PRHLT Research Center

Who idea is it? Acknowledging and building on other work, or just plain plagiarism. Allison Mann

Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 2011 Results [pan.webis.de]

Plagiarism Detection in Open Access Publications Jens Brandt, Martin Gutbrod, Oliver Wellnitz,

The Persian Empire The Conquerors of Everyone Start of the Persian Empire Starts with

Model-Based Development To develop complex software systems Model Validate Refine

Identifying successful features in extended definitions from Chemistry: A corpus study

Formally Specified Computer Algebra Software - DK10 Muhammad Taimoor Khan Supervisor: Prof.

Retargetable Compilers System on Chip Many different types of DSPs and embedded processors

Intermediate representations of functional programming languages for software quality control

A Study In Hebrew Paraphrase Identification Thesis Presentation Submitted by Gabriel Stanovsky

Aylin nald FOAI 11 Thank you FOAI for bringing together the assessment professionals for

Modelling and Analysis of Traffic Networks Based on Graph Transformation Juan de Lara E.T.S. de

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Greece & Persia ~The Persian Wars~ Cyrus the Great Led a Persian revolt against the