Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n ‐ gram Profiles g g Efstathios Stamatatos Efstathios Stamatatos University of the Aegean
Talk Layout Talk Layout • Introduction • The style change function The style change function • Detecting plagiarism • Evaluation • Conclusions Conclusions
Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection • Ambitious and demanding task • It can be used: It can be used: – When no appropriate reference corpus is available – When the reference corpus is too large (web) Wh h f i l ( b) • Closely related to authorship verification y p • Detection of irregularities of stylistic nature – However, not all stylistic irregularities are caused H t ll t li ti i l iti d by plagiarism
Representing Writing Style Representing Writing Style • Lexical features • Character features Character features • Syntactic features • Semantic features • Application ‐ specific features Application specific features
Character n grams Character n ‐ grams • Can be easily measured in any text • Language ‐ independent Language independent • Domain ‐ independent • Require no text ‐ preprocessing • Very effective in authorship attribution Very effective in authorship attribution • Robust to noise – Obfuscation in plagiarism can be considered as noise insertion
The Proposed Approach The Proposed Approach • The variation of document style is represented by the Th i i f d l i d b h style change function – Using a sliding window over the text ‐ length Using a sliding window over the text length • Writing style is represented by character n ‐ gram profiles profiles – The set of different character n ‐ grams encountered in the text and their normalized frequencies q • A set of heuristic rules: – Decide whether or not the document is plagiarism ‐ free p g – Detect the plagiarized section boundaries – Detect irrelevant stylistic inconsistencies
Representing Stylistic Changes Representing Stylistic Changes Sliding Sliding Profile of the window text window (length, step) Distance estimation Profile of the Document Document whole document High value means • stylistic anomaly Low value means • stylistic consistency t li ti i t
Distance Estimation Distance Estimation • The sliding window text is shorter (or much h lidi i d i h ( h shorter) than the whole document • An accurate and robust function for imbalanced profiles is proposed by p p p y (Stamatatos, 2007): 2 ⎛ ⎛ − ⎞ ⎞ 2 2 ( ( ( ( ) ) ( ( )) )) f f g g f f g g ∑ ⎜ ⎜ ⎟ ⎟ = A B ( , ) d A B ⎜ ⎟ 1 + ⎝ ( ) ( ) ⎠ f g f g ∈ ( ) g P A A B • This is not a symmetric function – dissimilarity rather than distance measure
Style Change Function Style Change Function • d 1 is normalized over the profile length: d i li d th fil l th 2 ⎛ − ⎞ 2 ( ( ) ( )) f g f g ∑ ∑ ⎜ ⎜ ⎟ ⎟ A B ⎜ ⎜ ⎟ ⎟ + ⎝ ( ) ( ) ⎠ f g f g ∈ = g P ( A ) A B ( , ) nd A B 1 4 ( ) P A • Then, the style change function sc of a document D is: sc ( i , D )= nd 1 ( w i , D ), i =1…| w | − l ⎢ + ⎢ ⎥ ⎥ x = • | w | depends on the text ‐ length: | | d d th t t l th 1 1 w ⎢ ⎥ ⎣ ⎦ s – x : text ‐ length – l : sliding window length l : sliding window length – s : sliding window step
An Example An Example 0 0 200 200 400 400 600 600 800 800 0.50 on 0 40 0.40 ge functio 0.30 Style chan 0.20 0.10 S 0.00 0 200 400 600 800 IPAT ‐ DC Sliding window position document #5
A Plagiarism free Example A Plagiarism ‐ free Example 0.50 on ge functio 0.40 0.30 yle chang 0.20 0.10 0 10 Sty 0.00 0 0 100 100 200 200 300 300 400 400 500 500 600 600 IPAT ‐ DC Sliding window position d document #17 t #17
Detecting Plagiarism on the Document Level • This is crucial to keep precision high Thi i i l t k i i hi h • Two options: – Pre ‐ processing – Post ‐ processing • Plagiarism ‐ free criterion : S < t 1 Pl i i f it i S < t where S : the standard deviation of the style change function S : the standard deviation of the style change function t 1 : a predefined threshold (0.02) • Deficiencies: • Deficiencies: – Very short documents tend to have low sc values – Very long documents may contain stylistically – Very long documents may contain stylistically inconsistent sections (high variance of sc)
A False Negative Example A False Negative Example 0 0 50 50 100 100 150 150 0.50 on 0.40 0 40 nge functio 0.30 Style chan 0.20 0.10 S IPAT ‐ DC 0.00 Document #34 0 50 100 150 Sliding window position
Identifying Plagiarized Passages Identifying Plagiarized Passages • It is assumed that at least half of the text is not i d h l h lf f h i plagiarized – The average sc value would correspond to the style of the alleged author • In general, it is not known the amount of I l i i k h f plagiarized text – All sc values greater than M+S are removed – M ′ and S ′ are then calculated • Plagiarized passage criterion: sc ( i ′ , D ) > M ′ + a * S ′ – a determines the sensitivity of the method (set to 2.0)
An Example An Example 0 0 200 200 400 400 600 600 800 800 0.50 n ge function 0.40 0.30 tyle chang 0.20 0 10 0.10 St 0.00 0 0 200 200 400 400 600 600 800 800 IPAT ‐ DC Sliding window position document #5
Another Example Another Example 0.50 on ge functio 0 40 0.40 0.30 yle chang 0.20 0 10 0.10 Sty 0.00 0 100 200 300 400 Sliding window position IPAT ‐ DC Document #22
Detecting Irrelevant Style Changes Detecting Irrelevant Style Changes • Not all stylistic changes are caused by plagiarism N t ll t li ti h d b l i i – Text formatting affects style – Genre affects style Genre affects style – … • To reduce the formatting factor: g – All text is transformed to lowercase – Every character n ‐ gram that contains no letter characters ( (a ‐ z) is removed from the profile ) i d f th fil – The sliding window parameters operate on letter characters • each window has the same number of letter characters (window length l ) but different number of total characters (real window length l ′ )
Detecting Irrelevant Style Changes Detecting Irrelevant Style Changes • To reduce the multiple genre factor: T d th lti l f t – Special Section Criterion: l ′ < t 2 where where – l ′ : the real window length – t : a predefined threshold (1 500) – t 2 : a predefined threshold (1,500) – It combines with the plagiarized passage criterion • Weaknesses Weaknesses – One can insert multiple non letter characters to obfuscate a plagiarized section – All special sections (table ‐ of ‐ contents, index) are considered plagiarism ‐ free
An Example An Example IPAT ‐ DC Document #46
Summary of Parameter Settings Summary of Parameter Settings Description Symbol Value Character n ‐ gram length 3 n Sliding window length 1,000 l Sliding window step 200 s Threshold of plagiarism ‐ free criterion 0.02 t 1 Real window length threshold 1,500 t 2 Sensitivity of plagiarism detection 2 a • Empirically derived, not optimized
Evaluation on the Document Level Evaluation on the Document Level Guess Guess Actual Actual Plagiarism ‐ free g Plagiarized g Plagiarism free Plagiarism ‐ free 1102 1102 545 (22%) 545 (22%) Plagiarized passages Plagiarized 443 1001 (78%) Upper bound for Recall for Recall • Results on IPAT ‐ DC
False Negatives False Negatives false negatives all documents • The majority of The majority of 1600 false negatives 1400 are relatively 1200 short documents short documents Documents s 1000 (<30K chars) 800 • The shorter a 600 D document, the 400 more likely to 200 false negative false negative 0 0 <10K 10K-30K 30K-100K >100K Text length (chars)
Evaluation on the Passage Level Evaluation on the Passage Level Corpus IPAT ‐ DC IPAT ‐ CC R Recall ll 0 4552 0.4552 0 4607 0.4607 Precision 0.2183 0.2321 F score F ‐ score 0 2876 0.2876 0 3086 0.3086 Granularity 1.22 1.25 Overall score 0.2358 0.2462 • Performance remains stable for both corpora
Recall and Precision vs Text length Recall and Precision vs. Text ‐ length • Recall is recall precision affected by affected by 60 60 decreasing 50 40 text ‐ length text length 30 – A result of 20 f l false negative 10 0 distribution <10K 10K-30K 30K-100K >100K Text length (chars)
Conclusions Conclusions • A fully ‐ automated approach f ll d h – Easy to follow (no text preprocessing) – Able to detect plagiarism ‐ free documents – Able to detect plagiarized passage boundaries • Nearly half of plagiarized passages are detected while precision remains low – An increased a value can improve precision (and harm recall) • Window length determines the shortest plagiarized passage that can be detected
Recommend
More recommend