Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using - PowerPoint PPT Presentation

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n ‐ gram Profiles g g Efstathios Stamatatos Efstathios Stamatatos University of the Aegean

Talk Layout Talk Layout • Introduction • The style change function The style change function • Detecting plagiarism • Evaluation • Conclusions Conclusions

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection • Ambitious and demanding task • It can be used: It can be used: – When no appropriate reference corpus is available – When the reference corpus is too large (web) Wh h f i l ( b) • Closely related to authorship verification y p • Detection of irregularities of stylistic nature – However, not all stylistic irregularities are caused H t ll t li ti i l iti d by plagiarism

Representing Writing Style Representing Writing Style • Lexical features • Character features Character features • Syntactic features • Semantic features • Application ‐ specific features Application specific features

Character n grams Character n ‐ grams • Can be easily measured in any text • Language ‐ independent Language independent • Domain ‐ independent • Require no text ‐ preprocessing • Very effective in authorship attribution Very effective in authorship attribution • Robust to noise – Obfuscation in plagiarism can be considered as noise insertion

The Proposed Approach The Proposed Approach • The variation of document style is represented by the Th i i f d l i d b h style change function – Using a sliding window over the text ‐ length Using a sliding window over the text length • Writing style is represented by character n ‐ gram profiles profiles – The set of different character n ‐ grams encountered in the text and their normalized frequencies q • A set of heuristic rules: – Decide whether or not the document is plagiarism ‐ free p g – Detect the plagiarized section boundaries – Detect irrelevant stylistic inconsistencies

Representing Stylistic Changes Representing Stylistic Changes Sliding Sliding Profile of the window text window (length, step) Distance estimation Profile of the Document Document whole document High value means • stylistic anomaly Low value means • stylistic consistency t li ti i t

Distance Estimation Distance Estimation • The sliding window text is shorter (or much h lidi i d i h ( h shorter) than the whole document • An accurate and robust function for imbalanced profiles is proposed by p p p y (Stamatatos, 2007): 2 ⎛ ⎛ − ⎞ ⎞ 2 2 ( ( ( ( ) ) ( ( )) )) f f g g f f g g ∑ ⎜ ⎜ ⎟ ⎟ = A B ( , ) d A B ⎜ ⎟ 1 + ⎝ ( ) ( ) ⎠ f g f g ∈ ( ) g P A A B • This is not a symmetric function – dissimilarity rather than distance measure

Style Change Function Style Change Function • d 1 is normalized over the profile length: d i li d th fil l th 2 ⎛ − ⎞ 2 ( ( ) ( )) f g f g ∑ ∑ ⎜ ⎜ ⎟ ⎟ A B ⎜ ⎜ ⎟ ⎟ + ⎝ ( ) ( ) ⎠ f g f g ∈ = g P ( A ) A B ( , ) nd A B 1 4 ( ) P A • Then, the style change function sc of a document D is: sc ( i , D )= nd 1 ( w i , D ), i =1…| w | − l ⎢ + ⎢ ⎥ ⎥ x = • | w | depends on the text ‐ length: | | d d th t t l th 1 1 w ⎢ ⎥ ⎣ ⎦ s – x : text ‐ length – l : sliding window length l : sliding window length – s : sliding window step

An Example An Example 0 0 200 200 400 400 600 600 800 800 0.50 on 0 40 0.40 ge functio 0.30 Style chan 0.20 0.10 S 0.00 0 200 400 600 800 IPAT ‐ DC Sliding window position document #5

A Plagiarism free Example A Plagiarism ‐ free Example 0.50 on ge functio 0.40 0.30 yle chang 0.20 0.10 0 10 Sty 0.00 0 0 100 100 200 200 300 300 400 400 500 500 600 600 IPAT ‐ DC Sliding window position d document #17 t #17

Detecting Plagiarism on the Document Level • This is crucial to keep precision high Thi i i l t k i i hi h • Two options: – Pre ‐ processing – Post ‐ processing • Plagiarism ‐ free criterion : S < t 1 Pl i i f it i S < t where S : the standard deviation of the style change function S : the standard deviation of the style change function t 1 : a predefined threshold (0.02) • Deficiencies: • Deficiencies: – Very short documents tend to have low sc values – Very long documents may contain stylistically – Very long documents may contain stylistically inconsistent sections (high variance of sc)

A False Negative Example A False Negative Example 0 0 50 50 100 100 150 150 0.50 on 0.40 0 40 nge functio 0.30 Style chan 0.20 0.10 S IPAT ‐ DC 0.00 Document #34 0 50 100 150 Sliding window position

Identifying Plagiarized Passages Identifying Plagiarized Passages • It is assumed that at least half of the text is not i d h l h lf f h i plagiarized – The average sc value would correspond to the style of the alleged author • In general, it is not known the amount of I l i i k h f plagiarized text – All sc values greater than M+S are removed – M ′ and S ′ are then calculated • Plagiarized passage criterion: sc ( i ′ , D ) > M ′ + a * S ′ – a determines the sensitivity of the method (set to 2.0)

An Example An Example 0 0 200 200 400 400 600 600 800 800 0.50 n ge function 0.40 0.30 tyle chang 0.20 0 10 0.10 St 0.00 0 0 200 200 400 400 600 600 800 800 IPAT ‐ DC Sliding window position document #5

Another Example Another Example 0.50 on ge functio 0 40 0.40 0.30 yle chang 0.20 0 10 0.10 Sty 0.00 0 100 200 300 400 Sliding window position IPAT ‐ DC Document #22

Detecting Irrelevant Style Changes Detecting Irrelevant Style Changes • Not all stylistic changes are caused by plagiarism N t ll t li ti h d b l i i – Text formatting affects style – Genre affects style Genre affects style – … • To reduce the formatting factor: g – All text is transformed to lowercase – Every character n ‐ gram that contains no letter characters ( (a ‐ z) is removed from the profile ) i d f th fil – The sliding window parameters operate on letter characters • each window has the same number of letter characters (window length l ) but different number of total characters (real window length l ′ )

Detecting Irrelevant Style Changes Detecting Irrelevant Style Changes • To reduce the multiple genre factor: T d th lti l f t – Special Section Criterion: l ′ < t 2 where where – l ′ : the real window length – t : a predefined threshold (1 500) – t 2 : a predefined threshold (1,500) – It combines with the plagiarized passage criterion • Weaknesses Weaknesses – One can insert multiple non letter characters to obfuscate a plagiarized section – All special sections (table ‐ of ‐ contents, index) are considered plagiarism ‐ free

An Example An Example IPAT ‐ DC Document #46

Summary of Parameter Settings Summary of Parameter Settings Description Symbol Value Character n ‐ gram length 3 n Sliding window length 1,000 l Sliding window step 200 s Threshold of plagiarism ‐ free criterion 0.02 t 1 Real window length threshold 1,500 t 2 Sensitivity of plagiarism detection 2 a • Empirically derived, not optimized

Evaluation on the Document Level Evaluation on the Document Level Guess Guess Actual Actual Plagiarism ‐ free g Plagiarized g Plagiarism free Plagiarism ‐ free 1102 1102 545 (22%) 545 (22%) Plagiarized passages Plagiarized 443 1001 (78%) Upper bound for Recall for Recall • Results on IPAT ‐ DC

False Negatives False Negatives false negatives all documents • The majority of The majority of 1600 false negatives 1400 are relatively 1200 short documents short documents Documents s 1000 (<30K chars) 800 • The shorter a 600 D document, the 400 more likely to 200 false negative false negative 0 0 <10K 10K-30K 30K-100K >100K Text length (chars)

Evaluation on the Passage Level Evaluation on the Passage Level Corpus IPAT ‐ DC IPAT ‐ CC R Recall ll 0 4552 0.4552 0 4607 0.4607 Precision 0.2183 0.2321 F score F ‐ score 0 2876 0.2876 0 3086 0.3086 Granularity 1.22 1.25 Overall score 0.2358 0.2462 • Performance remains stable for both corpora

Recall and Precision vs Text length Recall and Precision vs. Text ‐ length • Recall is recall precision affected by affected by 60 60 decreasing 50 40 text ‐ length text length 30 – A result of 20 f l false negative 10 0 distribution <10K 10K-30K 30K-100K >100K Text length (chars)

Conclusions Conclusions • A fully ‐ automated approach f ll d h – Easy to follow (no text preprocessing) – Able to detect plagiarism ‐ free documents – Able to detect plagiarized passage boundaries • Nearly half of plagiarized passages are detected while precision remains low – An increased a value can improve precision (and harm recall) • Window length determines the shortest plagiarized passage that can be detected

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using - PowerPoint PPT Presentation

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n gram Profiles g g Efstathios Stamatatos Efstathios Stamatatos University of the Aegean Talk Layout Talk Layout Introduction The style change function The

07.01.2011 Topics Plagiarism Detection Software 2010 Plagiarism Plagiarism Detection

WHAT IS PLAGIARISM? According to plagiarism.org, following to be plagiarism: To submit

Stylometry in plagiarism detection and author profiling Paolo Rosso PRHLT Research Center

Instructor-Centric Source Code Plagiarism Detection and Plagiarism Corpus Jonathan Y. H. Poon,

INTRINSIC PLAGIARISM DETECTION PAN 2011 @ CLEF USING CHARACTER TRIGRAM DISTANCE SCORES U N D E

plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 Plagiarism detection problem

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Introduction

Who idea is it? Acknowledging and building on other work, or just plain plagiarism. Allison Mann

Plagiarism Detection in Open Access Publications Jens Brandt, Martin Gutbrod, Oliver Wellnitz,

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 2011 Results [pan.webis.de]

Paragraph Clustering for Intrinsic Plagiarism Detection Using a Stylistic Vector Space Model

External and Intrinsic Plagiarism Detection using a Cross-Lingual Retrieval and Segmentation

Quite Simple Approaches for Authorship Attribution, Intrinsic Plagiarism Detection and Sexual

Towards the Exploitation of Statistical Language Models for Plagiarism Detection with Reference

External Plagiarism Detection using Information Retrieval and Sequence Alignment Rao Muhammad

HOST Cryptography II ECE 525 CryptoAnalysis Upper case letters can be represented by numbers

Orientation Assignment in Cryo-EM Amit Singer Princeton University Department of Mathematics and

Integration-by-parts reductions via algebraic geometry Kasper J. Larsen University of

On the inverse matrix of the Laplacian and all ones matrix Sho Suda (Joint work with Michio Seto

Modern Information Retrieval Dictionaries and and tolerant retrieval 1 Hamid Beigy Sharif

On the Extreme Eigenvalues of Certain Gram Matrices of Hermite Polynomials q Martin Ple

Local methods for on-demand OOV word retrieval Stanislas Oger, Georges Linar` es, Fr ed

Joint work with Marc Brockschmidt, Alex Gaunt, Alex Polozov, Patrick Fernandes, Mahmoud Khademi