TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B¨ uchler, Emily Franzini and Greta Franzini

METHODOLOGY Basic idea: Embed historical text reuse in Shannon’s N oisy Channel theorem. 2/26

MICROVIEW II Source: Stefan J¨ anicke, eTRACES project, University of Leipzig. 3/26

NOISY CHANNEL MINING I • Suffix: • Hyphen: bearing vs. childbearing birth-day vs. birthday back-bone vs. backbone • Composition: zareth-shahar vs. zarethshahar sea-beast vs. sea-monster (synonym) sea-gull vs. sea-mew vs. sea-hawk • Prefix: (cohyponym) ambush vs. ambushment apple-tree vs. citron-tree (cohyponym) shimite vs. shimites 4/26

NOISY CHANNEL MINING II • Orthographically similar words: anathothite vs. anethothite vs. anetothite vs. annethothite vs. antothite • Some 4000 word pairs containing noise are extracted but not classified . But also: punishment vs. torment • Any kind of negation (e.g. book Genesis, chapter 34, verse 19): not defer (ASV, KJV, Webster), without loss of time (Basic), not delay (Darby, YLT), and not wait (WEB) 5/26

METHODOLOGY Basic idea: Embed historical text reuse in Shannon’s Noisy Channel theorem. 6/26

METHODOLOGY: NOISY CHANNEL EVALUATION I Hint: T he results are ALWAYS compared between the natural texts and the randomised texts as a whole. 7/26

METHODOLOGY: NOISY CHANNEL EVALUATION II S ignal-Noise-Ratio adapted from signal- and satellite techniques: SNR = P signal P noise Signal-Noise-Ratio scaled , unit is dB: � P signal � SNR db = 10 . log 10 P noise Mining Ability (in dB): The Mining Ability describes the power of a method to make distinctions between natural-language structures/patterns and random noise given a model with the same parameters. | E D s ,φ Θ | L Quant (Θ) = 10 . log 10 s , φ Θ | ) dB max ( 1 , | E D m 8/26

METHODOLOGY: NOISY CHANNEL EVALUATION III M otivation for randomisation by Word Shuffling : 1. Syntax and distributional semantics are randomised and ”destroyed”. 2. Distributions of words and sentence lengths remain unchanged; changes JUST and ONLY depend on destruction of 1) and are not induced by changes of distributions. 3. Easy measurement of ”randomness” of the randomising method with the entropy test: ∆ H n = H max − H n Die Wahl von n ∈ [ 180 , 183 ] sichert eine Genauigkeit von ∆ H n ≤ 10 − 3 Bit f¨ ur den Entropietest. 9/26

METHODOLOGY: TEXT RE-USE COMPRESSION � m � n i = 1 θ Θ ( S i , S j ) j = 1 C Θ = n . m 10/26

RANDOMNESS & STRUCTURE Question: Why is the result of a randomised Digital Library typically not empty? 11/26

RANDOMNESS & STRUCTURE: IMPACTS C orpus size in sentences (average sentence length is ca. 18 words). LGL is the threshold for the Log-Likelihood-Ratio. 12/26

TEXT REUSE IN ENGLISH BIBLE VERSIONS Why does the use of the Bible make sense? • The Bible is easy to evaluate . • There are different editions written for different purposes . 13/26

TEXT REUSE IN ENGLISH BIBLE VERSIONS 1. American Standard Version (ASV): 20th century, focus is USA; 2. Bible in Basic English (BBE): Verses are written in a simplified language; 3. Darby Version (DBY): created in the 19th century from Hebrew and Greek texts, multiple authors through death of Darby; 4. King James Version (KJV): one of the oldest English Bible versions (16th Cent.); 5. Webster’s Revision (WBS): Revision of KJV in 19th century; 6. World English Bible (WEB): 21st century, global focus; 7. Young Literal Translation (YLT): Verses in Hebrew syntax. 14/26

TEXT REUSE ON ENGLISH BIBLE VERSIONS: EVALUATION Exampl e: book Genesis, chapter 1, verse 1. Reduced Bibles: all seven reduced Bible versions contain ”only” the 28632 verses present in all seven editions. 15/26

TEXT REUSE IN ENGLISH BIBLE VERSIONS: SETUP Segmentation: disjoint and verse-wise segmentation. Selection: max pruning with a Feature Density of 0.8; Linking: Inter- Digital Library Linking (different Bible editions); Scoring: Broder’s Resemblance with a threshold of 0.6; Post-processing: not used. 16/26

TEXT REUSE IN ENGLISH BIBLE VERSIONS: RESULTS RECALL 17/26

TEXT REUSE IN ENGLISH BIBLE VERSIONS: RECALL VS. TEXT REUSE COMPRESSION With Without 18/26

TEXT REUSE IN ENGLISH BIBLE VERSIONS: DEPENDENCY OF RECALL & TR COMPRESSION I 19/26

TEXT REUSE IN ENGLISH BIBLE VERSIONS: DEPENDENCY OF RECALL & TR COMPRESSION II 20/26

TEXT REUSE IN ENGLISH BIBLE VERSIONS: F-MEASURE VS. NOISY CHANNEL EVAL. I F-Measure: WBS, ASV, DBY, WEB, YLT, BBE NCE: WBS, ASV, DBY, WEB, BBE, YLT 21/26

MICROVIEW I Source: Stefan J¨ anicke, eTRACES project, University of Leipzig. 22/26

DEPENDENCY OF RECALL AND TR COMPRESSION 23/26

FINITO! 24/26

CONTACT T eam Marco B¨ uchler, Greta Franzini and Emily Franzini. Visit us http://www.etrap.eu contact@etrap.eu 25/26

LICENCE T he theme this presentation is based on is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Changes to the theme are the work of eTRAP. cba 26/26

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, Emily Franzini and Greta Franzini METHODOLOGY Basic idea: Embed historical text reuse in Shannons N oisy Channel theorem. 2/26 MICROVIEW II Source: Stefan J anicke,

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION M arco B

Introduction to Packet Tracer What is Packet Tracer? Packet Tracer is a protocol simulator

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B uchler, Emily Franzini and Greta

TRACER TUTORIAL: TEXT REUSE DETECTION SELECTION Mar co B uchler, Emily Franzini and Greta

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO THE COMMAND LINE AND ACCESSING SERVERS

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

1 Infrastructure Requirements Limit Reuse Planned Indirect Potable Reuse (Purple pipe may be a

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Conco System s The Practical Application of Tracer Gas Leak Detection for Air Cooled Condensers

United Kingdom & United States TRACER / FSCS Combined Analysis Paper prepared by: William J

United Kingdom and United States TRACER / FSCS Combined Analysis Presentation to 16 ISMOR 2nd.

RECON 2010 - Montreal Metasm Tracer MSR NIC Plan Metasm 1 Tracer 2 MSR 3 NIC 4 A.

Tracer Methodology Stacy Olea, MBA, MT(ASCP), FACHE Executive Director Lab Accreditation April

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Approximately 9,900 Employees Nationwide Approximately 37 Companies Nationwide Over 100

COGS 121 HCI Programming Studio Spring 2016 Instructor: Nadir Weibel Website: cogs121.ucsd.edu

Introduction to Project Management Damian McCourt HELLO! I am Damian McCourt Training full

Lecture 5: Project Management Software Engineering, SS 06 History What is a project? Every

Project Management and Status V. Krabbendam LSST Project Manager March 2017 LLSST Joint

Towards Automated Software Project Planning Extending Palladio for the Simulation of Software

Steps in Project Control 1. Set a baseline plan 2. Measure progress and performance 3. Compare

WP4 Functional diversification Task 4.2 Smart Energy Gaudenzio Meneghesso IUNET

Sambuz

Useful Links

Newsletter

Mail Us

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, Emily Franzini and Greta Franzini METHODOLOGY Basic idea: Embed historical text reuse in Shannons N oisy Channel theorem. 2/26 MICROVIEW II Source: Stefan J anicke,

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION M arco B

Introduction to Packet Tracer What is Packet Tracer? Packet Tracer is a protocol simulator

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B uchler, Emily Franzini and Greta

TRACER TUTORIAL: TEXT REUSE DETECTION SELECTION Mar co B uchler, Emily Franzini and Greta

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO THE COMMAND LINE AND ACCESSING SERVERS

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

1 Infrastructure Requirements Limit Reuse Planned Indirect Potable Reuse (Purple pipe may be a

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Conco System s The Practical Application of Tracer Gas Leak Detection for Air Cooled Condensers

United Kingdom &amp; United States TRACER / FSCS Combined Analysis Paper prepared by: William J

United Kingdom and United States TRACER / FSCS Combined Analysis Presentation to 16 ISMOR 2nd.

RECON 2010 - Montreal Metasm Tracer MSR NIC Plan Metasm 1 Tracer 2 MSR 3 NIC 4 A.

Tracer Methodology Stacy Olea, MBA, MT(ASCP), FACHE Executive Director Lab Accreditation April

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Approximately 9,900 Employees Nationwide Approximately 37 Companies Nationwide Over 100

COGS 121 HCI Programming Studio Spring 2016 Instructor: Nadir Weibel Website: cogs121.ucsd.edu

Introduction to Project Management Damian McCourt HELLO! I am Damian McCourt Training full

Lecture 5: Project Management Software Engineering, SS 06 History What is a project? Every

Project Management and Status V. Krabbendam LSST Project Manager March 2017 LLSST Joint

Towards Automated Software Project Planning Extending Palladio for the Simulation of Software

Steps in Project Control 1. Set a baseline plan 2. Measure progress and performance 3. Compare

WP4 Functional diversification Task 4.2 Smart Energy Gaudenzio Meneghesso IUNET

Sambuz

Useful Links

Newsletter

Mail Us

United Kingdom & United States TRACER / FSCS Combined Analysis Paper prepared by: William J