tracer tutorial text reuse detection recent work
play

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, Emily Franzini and Greta Franzini METHODOLOGY Basic idea: Embed historical text reuse in Shannons N oisy Channel theorem. 2/26 MICROVIEW II Source: Stefan J anicke,


  1. TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B¨ uchler, Emily Franzini and Greta Franzini

  2. METHODOLOGY Basic idea: Embed historical text reuse in Shannon’s N oisy Channel theorem. 2/26

  3. MICROVIEW II Source: Stefan J¨ anicke, eTRACES project, University of Leipzig. 3/26

  4. NOISY CHANNEL MINING I • Suffix: • Hyphen: bearing vs. childbearing birth-day vs. birthday back-bone vs. backbone • Composition: zareth-shahar vs. zarethshahar sea-beast vs. sea-monster (synonym) sea-gull vs. sea-mew vs. sea-hawk • Prefix: (cohyponym) ambush vs. ambushment apple-tree vs. citron-tree (cohyponym) shimite vs. shimites 4/26

  5. NOISY CHANNEL MINING II • Orthographically similar words: anathothite vs. anethothite vs. anetothite vs. annethothite vs. antothite • Some 4000 word pairs containing noise are extracted but not classified . But also: punishment vs. torment • Any kind of negation (e.g. book Genesis, chapter 34, verse 19): not defer (ASV, KJV, Webster), without loss of time (Basic), not delay (Darby, YLT), and not wait (WEB) 5/26

  6. METHODOLOGY Basic idea: Embed historical text reuse in Shannon’s Noisy Channel theorem. 6/26

  7. METHODOLOGY: NOISY CHANNEL EVALUATION I Hint: T he results are ALWAYS compared between the natural texts and the randomised texts as a whole. 7/26

  8. METHODOLOGY: NOISY CHANNEL EVALUATION II S ignal-Noise-Ratio adapted from signal- and satellite techniques: SNR = P signal P noise Signal-Noise-Ratio scaled , unit is dB: � P signal � SNR db = 10 . log 10 P noise Mining Ability (in dB): The Mining Ability describes the power of a method to make distinctions between natural-language structures/patterns and random noise given a model with the same parameters. | E D s ,φ Θ | L Quant (Θ) = 10 . log 10 s , φ Θ | ) dB max ( 1 , | E D m 8/26

  9. METHODOLOGY: NOISY CHANNEL EVALUATION III M otivation for randomisation by Word Shuffling : 1. Syntax and distributional semantics are randomised and ”destroyed”. 2. Distributions of words and sentence lengths remain unchanged; changes JUST and ONLY depend on destruction of 1) and are not induced by changes of distributions. 3. Easy measurement of ”randomness” of the randomising method with the entropy test: ∆ H n = H max − H n Die Wahl von n ∈ [ 180 , 183 ] sichert eine Genauigkeit von ∆ H n ≤ 10 − 3 Bit f¨ ur den Entropietest. 9/26

  10. METHODOLOGY: TEXT RE-USE COMPRESSION � m � n i = 1 θ Θ ( S i , S j ) j = 1 C Θ = n . m 10/26

  11. RANDOMNESS & STRUCTURE Question: Why is the result of a randomised Digital Library typically not empty? 11/26

  12. RANDOMNESS & STRUCTURE: IMPACTS C orpus size in sentences (average sentence length is ca. 18 words). LGL is the threshold for the Log-Likelihood-Ratio. 12/26

  13. TEXT REUSE IN ENGLISH BIBLE VERSIONS Why does the use of the Bible make sense? • The Bible is easy to evaluate . • There are different editions written for different purposes . 13/26

  14. TEXT REUSE IN ENGLISH BIBLE VERSIONS 1. American Standard Version (ASV): 20th century, focus is USA; 2. Bible in Basic English (BBE): Verses are written in a simplified language; 3. Darby Version (DBY): created in the 19th century from Hebrew and Greek texts, multiple authors through death of Darby; 4. King James Version (KJV): one of the oldest English Bible versions (16th Cent.); 5. Webster’s Revision (WBS): Revision of KJV in 19th century; 6. World English Bible (WEB): 21st century, global focus; 7. Young Literal Translation (YLT): Verses in Hebrew syntax. 14/26

  15. TEXT REUSE ON ENGLISH BIBLE VERSIONS: EVALUATION Exampl e: book Genesis, chapter 1, verse 1. Reduced Bibles: all seven reduced Bible versions contain ”only” the 28632 verses present in all seven editions. 15/26

  16. TEXT REUSE IN ENGLISH BIBLE VERSIONS: SETUP Segmentation: disjoint and verse-wise segmentation. Selection: max pruning with a Feature Density of 0.8; Linking: Inter- Digital Library Linking (different Bible editions); Scoring: Broder’s Resemblance with a threshold of 0.6; Post-processing: not used. 16/26

  17. TEXT REUSE IN ENGLISH BIBLE VERSIONS: RESULTS RECALL 17/26

  18. TEXT REUSE IN ENGLISH BIBLE VERSIONS: RECALL VS. TEXT REUSE COMPRESSION With Without 18/26

  19. TEXT REUSE IN ENGLISH BIBLE VERSIONS: DEPENDENCY OF RECALL & TR COMPRESSION I 19/26

  20. TEXT REUSE IN ENGLISH BIBLE VERSIONS: DEPENDENCY OF RECALL & TR COMPRESSION II 20/26

  21. TEXT REUSE IN ENGLISH BIBLE VERSIONS: F-MEASURE VS. NOISY CHANNEL EVAL. I F-Measure: WBS, ASV, DBY, WEB, YLT, BBE NCE: WBS, ASV, DBY, WEB, BBE, YLT 21/26

  22. MICROVIEW I Source: Stefan J¨ anicke, eTRACES project, University of Leipzig. 22/26

  23. DEPENDENCY OF RECALL AND TR COMPRESSION 23/26

  24. FINITO! 24/26

  25. CONTACT T eam Marco B¨ uchler, Greta Franzini and Emily Franzini. Visit us http://www.etrap.eu contact@etrap.eu 25/26

  26. LICENCE T he theme this presentation is based on is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Changes to the theme are the work of eTRAP. cba 26/26

Recommend


More recommend