Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 2011 Results [pan.webis.de]
The PAN Competition Plagiarism Detection The web is rife with text reuse: boilerplate, translations, paraphrases, summaries, and plagiarism. c 2 � www.webis.de
The PAN Competition Plagiarism Detection The web is rife with text reuse: boilerplate, translations, paraphrases, summaries, and plagiarism. Tasks: ❑ External Detection . Given a suspicious document and a set of potential source documents, the task is to find all plagiarized passages in the suspicious document and their corresponding source passages in the source documents. ❑ Intrinsic Detection . Given a suspicious document, the task is to extract all plagiarized passages based on clues extracted from the document itself. Corpus: ❑ PAN plagiarism corpus of 2010, 2011 [www.webis.de/research/corpora] ❑ 61 000 plagiarism cases hidden in about 27 000 documents ❑ 5 plagiarism-relevant parameters (length, language, task, obfuscation, fraction) c 3 � www.webis.de
The PAN Competition External plagiarism detection: Plagdet Precision Recall Granularity 0 0.5 1 0 0.5 1 0 0.5 1 1 1.5 2 Grman Grozea Oberreuter Cooke Torrejón Rao Palkovskii Nawab Ghosh Intrinsic plagiarism detection: Oberreuter Stamatatos Kestemont Akiva Gupta ❑ Plagdet combines the measures as F / log(granularity). ❑ Granularity measures the average number of times a plagiarism case is detected. c 4 � www.webis.de
The PAN Competition Authorship Identification Many texts on the web are of uncertain authorship. c 5 � www.webis.de
The PAN Competition Authorship Identification Many texts on the web are of uncertain authorship. Tasks: ❑ Authorship Attribution. Given a document of uncertain authorship and documents from a set of candidate authors, the task is to map the document onto its true authors among the candidates. ❑ Authorship Verification. Given a document of uncertain authorship and a document from a specific author, the task is to determine whether the given text has been written by that author. Corpus: ❑ Subset of the Enron Email Dataset [www.cs.cmu.edu/~enron] ❑ More than 12 000 documents written by 118 authors. ❑ 3 relevant parameters (task, canidate set size, closed vs. open canidate set) c 6 � www.webis.de
The PAN Competition Authorship attribution: F Precision Recall 0 0.5 1 0 0.5 1 0 0.5 1 Tanguy Mikros Escalante Kourtis Luyckx Vilarino Kern Snyder Ryan Solorio Eriksson Noecker Authorship verification: Escalante Snider Kern Eriksson Tanguy Vilarino Mikros c 7 � www.webis.de
The PAN Competition Wikipedia Vandalism Detection Every edit on Wikipedia has to be double-checked for integrity. c 8 � www.webis.de
The PAN Competition Wikipedia Vandalism Detection Every edit on Wikipedia has to be double-checked for integrity. Task: ❑ Given a set of edits on Wikipedia articles, separate the ill-intentioned edits from the well-intentioned edits. Corpus: ❑ PAN Wikipedia vandalism corpus of 2010, 2011 [www.webis.de/research/corpora] ❑ About 2 800 vandalism cases among about 30 000 edits ❑ 3 languages with corpus annotations obtained from Mechanical Turk. c 9 � www.webis.de
The PAN Competition Wikipedia Vandalism Detection 1 1 1 West and Lee (PR-AUC 0.48938) 0.8 0.8 0.8 Aksit (PR-AUC 0.22077) West and Lee (PR-AUC 0.82230) Precision 0.6 Precision 0.6 Precision 0.6 West and Lee Dragusanu et al. ˘ ¸ (PR-AUC 0.70591) (PR-AUC 0.42464) Aksit 0.4 0.4 0.4 (PR-AUC 0.18978) 0.2 0.2 0.2 English German Spanish 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Recall Recall 1 0.8 0.6 Precision West and Lee (PR-AUC 0.75385) 0.4 Mola Velasco (PR-AUC 0.66522) Adler et al. 0.2 (PR-AUC 0.49263) English, PAN-WVC-10 0 0 0.2 0.4 0.6 0.8 1 Recall c 10 � www.webis.de
Quo Vadis PAN?
Quo Vadis PAN? Lessons Learned and Outlook ❑ Focus & Simplicity ➜ Focus on specific aspects of the tasks. ➜ Reduced number of task variants. ➜ Reduced number of parameters and limited ranges. ❑ Realism & Scale ➜ New corpora for plagiarism detection and authorship identification. ➜ Scale up where necessary, scale down otherwise. ❑ Contributions & Challenges ➜ Inclusion of real plagiarism and real cases of disputed authorship. ➜ Distinguishing text reuse and plagiarism. ➜ Considering human performance. c 12 � www.webis.de
Thank you! Visit us at pan.webis.de. Mail us at pan@webis.de.
Recommend
More recommend