Developing a corpus of plagiarized short answers [Clough and Stevenson, 2011] Developing a corpus of plagiarized short Bj¨ orn Rudzewitz University of answers [Clough and Stevenson, 2011] T¨ ubingen Introduction Plagiarism orn Rudzewitz 1 Bj¨ Typology Corpus Creation University of T¨ ubingen Data Analysis Individual Differences Data Observations Automatic Hauptseminar Language Variation and Stylometrics Plagiarism Detection WS 15/16 N-Gram Overlap LCS Baselines L1 vs L2 December 16, 2015 Classification Conclusion Discussion References 1 brzdwtz@sfs.uni-tuebingen.de
Developing a Introduction corpus of plagiarized short answers [Clough Plagiarism Typology and Stevenson, 2011] Corpus Creation Bj¨ orn Rudzewitz University of Data Analysis T¨ ubingen Individual Differences Introduction Data Observations Plagiarism Typology Automatic Plagiarism Detection Corpus Creation N-Gram Overlap Data Analysis LCS Individual Differences Data Observations Baselines Automatic Plagiarism L1 vs L2 Detection N-Gram Overlap Classification LCS Baselines Conclusion L1 vs L2 Classification Conclusion Discussion Discussion To avoid the objection of plagiarism: References ideas and examples in this presentation are taken from Clough and Stevenson [2011]
Developing a Motivation corpus of plagiarized short answers [Clough and Stevenson, 2011] Bj¨ orn Rudzewitz University of ◮ correlation between availability of electronic resources T¨ ubingen and plagiarism Introduction ◮ plagiarism detection as a field suffering from lack of Plagiarism Typology standardized evaluation resources Corpus Creation ◮ previous corpus creation efforts suboptimal: Data Analysis Individual Differences ◮ lack of data (’deception’, how to find plagiarized text) Data Observations ◮ lack of gold labels (authors deny judgments) Automatic ◮ lack of legal and ethical basis for data publication Plagiarism Detection ◮ lack of transparency in data preparation N-Gram Overlap LCS ( → Leech’s maximes for corpus creation) Baselines L1 vs L2 Classification Conclusion Discussion References
Developing a Impact and application corpus of plagiarized short answers [Clough and Stevenson, 2011] Bj¨ orn Rudzewitz University of T¨ ubingen Introduction Desired effects of the corpus: Plagiarism Typology ◮ new resource for comparative evaluation and Corpus Creation pedagogical methods Data Analysis ◮ enable new work on plagiarism detection and task Individual Differences Data Observations strategies Automatic Plagiarism Detection N-Gram Overlap LCS Baselines L1 vs L2 Classification Conclusion Discussion References
Developing a Related work corpus of plagiarized short answers [Clough and Stevenson, 2011] Bj¨ orn Rudzewitz University of ◮ Microsoft Research Paraphrase Corpus [Dolan et al., T¨ ubingen 2004] Introduction ◮ Multiple-Translation Chinese Corpus [Pang et al., 2003] Plagiarism Typology ◮ METER corpus [Gaizauskas et al., 2001] Corpus Creation ◮ Corpus for plagiarism detection [Zu Eissen et al., 2007] Data Analysis Individual Differences ◮ PAN Plagiarism detection corpus [Eiselt and Rosso, Data Observations Automatic 2009] Plagiarism Detection N-Gram Overlap LCS More related resources in Machine Translation evaluation and Short Baselines L1 vs L2 Answer Assessment. Classification Conclusion Discussion References
Developing a High-level perspective on approaches corpus of plagiarized short answers [Clough and Stevenson, 2011] Bj¨ orn Rudzewitz University of ◮ extrinsic T¨ ubingen ◮ comparison of source and (potentially) plagiarized text Introduction ◮ authorship attribution approaches Plagiarism ◮ intrinsic Typology ◮ comparison of text passages in one document with each Corpus Creation other Data Analysis Individual Differences ◮ stylometric approaches Data Observations Automatic Plagiarism Problem: documents can plagiarize n ∈ N 0 other documents in Detection N-Gram Overlap different ways LCS Baselines L1 vs L2 → interaction between extrinsic and intrinsic analysis desirable Classification Conclusion Discussion References
Developing a Plagiarism Techniques: How to plagiarize corpus of plagiarized short answers [Clough and Stevenson, 2011] Goal: produce an answer of 200-300 words to a question Bj¨ orn Rudzewitz ◮ Near copy University of T¨ ubingen ◮ copy-and-paste (parts of) Wikipedia article Introduction Plagiarism Typology Corpus Creation Data Analysis Individual Differences Data Observations Automatic Plagiarism Detection N-Gram Overlap LCS Baselines L1 vs L2 Classification Conclusion Discussion References
Developing a Plagiarism Techniques: How to plagiarize corpus of plagiarized short answers [Clough and Stevenson, 2011] Goal: produce an answer of 200-300 words to a question Bj¨ orn Rudzewitz ◮ Near copy University of T¨ ubingen ◮ copy-and-paste (parts of) Wikipedia article Introduction ◮ Light revision Plagiarism ◮ like light revision, but with possibility to replace words Typology with synonyms, (lexical/morphosyntactic) paraphrasing Corpus Creation ◮ information structure preserved Data Analysis Individual Differences Data Observations Automatic Plagiarism Detection N-Gram Overlap LCS Baselines L1 vs L2 Classification Conclusion Discussion References
Developing a Plagiarism Techniques: How to plagiarize corpus of plagiarized short answers [Clough and Stevenson, 2011] Goal: produce an answer of 200-300 words to a question Bj¨ orn Rudzewitz ◮ Near copy University of T¨ ubingen ◮ copy-and-paste (parts of) Wikipedia article Introduction ◮ Light revision Plagiarism ◮ like light revision, but with possibility to replace words Typology with synonyms, (lexical/morphosyntactic) paraphrasing Corpus Creation ◮ information structure preserved Data Analysis Individual Differences ◮ Heavy revision Data Observations ◮ rephrasing/paraphrasing of Wikipedia article, n-to-m Automatic Plagiarism sentence alignment Detection N-Gram Overlap LCS Baselines L1 vs L2 Classification Conclusion Discussion References
Developing a Plagiarism Techniques: How to plagiarize corpus of plagiarized short answers [Clough and Stevenson, 2011] Goal: produce an answer of 200-300 words to a question Bj¨ orn Rudzewitz ◮ Near copy University of T¨ ubingen ◮ copy-and-paste (parts of) Wikipedia article Introduction ◮ Light revision Plagiarism ◮ like light revision, but with possibility to replace words Typology with synonyms, (lexical/morphosyntactic) paraphrasing Corpus Creation ◮ information structure preserved Data Analysis Individual Differences ◮ Heavy revision Data Observations ◮ rephrasing/paraphrasing of Wikipedia article, n-to-m Automatic Plagiarism sentence alignment Detection N-Gram Overlap ◮ Non-plagiarism LCS Baselines ◮ no access to Wikipedia L1 vs L2 Classification ◮ participants read material, then answer question with Conclusion their (partly freshly) acquired knowledge Discussion References
Developing a Corpus Creation corpus of plagiarized short answers [Clough and Stevenson, 2011] ◮ 19 participants, CS students Bj¨ orn Rudzewitz University of T¨ ubingen ◮ each participant writing answer for each task (2 times non-plagiarism) Introduction → 95 answers + 5 articles = 100 documents (19 , 995 Plagiarism Typology tokens) Corpus Creation ◮ Graeco-Latin Square Design for systematic Data Analysis Individual Differences randomization and rotation of revision types per Data Observations participant and question Automatic Plagiarism ◮ participant meta data: native language, familiarity with Detection N-Gram Overlap answer, perceived difficulty of task LCS Baselines L1 vs L2 µ tok / aw = 208 σ tok / aw = 64 . 91 Classification µ types / aw = 113 σ types / aw = 30 . 11 Conclusion Discussion References
Developing a Data Analysis: Individual Differences corpus of plagiarized short answers [Clough and Stevenson, 2011] Bj¨ orn Rudzewitz University of T¨ ubingen ◮ statistically significant difference ( p < 0 . 01) between Introduction native and non-native speakers wrt. mean knowledge Plagiarism Typology and perceived difficulty (two-sample t-test) Corpus Creation → difference in population means of two independent Data Analysis Individual Differences samples Data Observations ◮ Positive Pearson’s correlation of r = 0 . 344 between Automatic Plagiarism knowledge and perceived difficulty Detection N-Gram Overlap LCS Baselines L1 vs L2 Classification Conclusion Discussion References
Developing a Data Analysis: Observations corpus of plagiarized short answers [Clough and Stevenson, 2011] Bj¨ orn Rudzewitz University of T¨ ubingen Introduction Plagiarism Typology Corpus Creation Data Analysis Individual Differences Data Observations Automatic Plagiarism Detection N-Gram Overlap LCS Baselines L1 vs L2 Classification Conclusion Discussion References
Developing a Data Analysis: Observations corpus of plagiarized short answers [Clough and Stevenson, 2011] Bj¨ orn Rudzewitz University of T¨ ubingen Introduction Plagiarism Typology Corpus Creation Data Analysis Individual Differences Data Observations Automatic Plagiarism Detection N-Gram Overlap LCS Baselines L1 vs L2 Classification Conclusion Discussion References
Recommend
More recommend