TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B¨ uchler, Emily Franzini and Greta Franzini

TABLE OF CONTENTS 1. What is preprocessing? 2. Preprocessing techniques 3. Hacking 4. Conclusion and revision 2/100

HACKING, INSTALLATION & CONFIGURATION GUIDE FOR TRACER 1. Download TRACER from http://etrap.eu/tracer/ to your storage folder, e.g. /roedel/mbuechler 2. Using the command line, navigate to your storage folder with the cd command 3. U nzip archive: gunzip tracer.tar.gz 4. Untar archive: tar -xvf tracer.tar 5. Change to the TRACER folder: cd TRACER 6. Open the configuration file with vim conf/tracer config.xml 7. Configure your input file: 3/100

HACKING: STARTING TRACER Start the tool with the command: java -Xmx600m -Deu.etrap.medusa.config.ClassConfig=conf/tracer config.xml -jar tracer.jar Explanation: • -Xmx600m (up to 600 MB memory); • -Dfile.encoding sets the encoding of your input file (optionally); • -Deu.etrap.medusa.config.ClassConfig (configuration file). 4/100

WHAT IS PREPROCESSING?

REMINDER: CURRENT APPROACH 6/100

PRE-STEP: SEGMENTATION - AN EXAMPLE 7/100

PRE-STEP: SEGMENTATION 8/100

QUESTION What do you associate with preprocessing ? 9/100

FOUNDATIONS FOR PREPROCESSING: ZIPFIAN LAW 10/100

IMPLICATIONS OF THE ZIPFIAN LAW • Approx. 50% of all words occur only once • Approx. 16% of all words occur only twice • Approx. 8% of all words occur three times • ... • Approx. 90% of all words in a corpus occur 10 times or less n 1 1 s n ( f ) = � s ( f ) = f ∗ ( f + 1 ) f ∗ ( f + 1 ) f = 1 • The top 300-700 most frequent words cover already about 50% of all tokens (depending language) 11/100

QUESTION What does lemmatisation mean for this plot? 12/100

PREPROCESSING TECHNIQUES

PREPROCESSING 14/100

PREPROCESSING: DIRECTED GRAPH NORMALISATION E.g. lemmatisation 15/100

PREPROCESSING: INDIRECTED GRAPH NORMALISATION E.g. synonyms, string similarity 16/100

HACKING

HACKING Tasks: • Run on your texts ... 1. ... without preprocessing 2. ... 1) + lemmatisation 3. ... 2) + synonym replacement 18/100

HACKING Questions: • Compare the input file with the *.prep file for all preprocessing techniques. Which methods seem to work best for you? Which make no sense for the dataset? • Compare all *.meta files containing some numbers! How many words have changed and through which method? • (optional and advanced) What is the number of word types for each preprocessing technique (can be derived from the first column of *.prep.inv ). 19/100

PREPROCESSING 1) WITHOUT PREPROCESSING Hint: • The configuration file can be found in: $ TRACER HOME/conf/tracer conf.xml • All values show false . 20/100

PREPROCESSING 2) REMOVING DIACHRITICS Hint: • boolRemoveDiachritics is switched on by value true . 21/100

PREPROCESSING 4) LEMMATISING TEXT Hint: • boolLemmatisation is switched on by value true . • Lemmatisation can be configured by: < property name="BASEFORM FILE NAME" value="data/corpora/Bible/Bible.lemma" / > 22/100

PREPROCESSING 5) SYNONYM HANDLING Hint: • boolReplaceSynonyms is switched on by value true . • Synonyms can be configured by: < property name="SYNONYMS FILE NAME" value="data/corpora/Bible/Bible.syns" / > 23/100

PREPROCESSING 6) STRING SIMILARITY FOR NORMALISING VARIANTS Hint: • boolReplaceStringSimilarWords is switched on by value true . • Thresholds: < property name="SYNONYMS FILE NAME" value="data/corpora/Bible/Bible.syns" / > 24/100

OPEN ISSUE: FRAGMENTARY WORDS 25/100

OPEN ISSUE: FRAGMENTARY WORDS - DEALING WITH GAPS AND LEIDEN CONVENTIONS 26/100

GAP BETWEEN KNOWLEDGE AND EXPERIENCE 27/100

CONCLUSION AND REVISION

CHECK Statement: • ”My lemmatisation tool < XYZ > is able to compute the base forms of 80% of all tokens in a corpus.” Good or bad? 29/100

CHECK Fact file: • Language variants • Different writing styles • (Some) dialects • Diachritics • OCR errors Question: What’s the difference for you? 30/100

CHECK Fact file: • Language variants • Different writing styles • (Some) dialects • Diachritics • OCR errors Question: What do you think is the difference for the computer? 31/100

IMPORTANCE OF PREPROCESSING • Cleaning and harmonising the data. • When working with a new corpus -not only the language but also the same language in different epochs or geographical regions- cleaning/harmonising the data can take up to 70% of the overall time. Preprocessing mantra: Garbage in, garbage out 32/100

FINITO! 33/100

CONTACT Team Marco B¨ uchler, Greta Franzini and Emily Franzini. Visit us http://www.etrap.eu contact@etrap.eu 34/100

LICENCE The theme this presentation is based on is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Changes to the theme are the work of eTRAP. cba 35/100

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta Franzini TABLE OF CONTENTS 1. What is preprocessing? 2. Preprocessing techniques 3. Hacking 4. Conclusion and revision 2/100 HACKING,

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION M arco B

Introduction to Packet Tracer What is Packet Tracer? Packet Tracer is a protocol simulator

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, Emily Franzini and Greta

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B uchler, Emily Franzini and Greta

TRACER TUTORIAL: TEXT REUSE DETECTION SELECTION Mar co B uchler, Emily Franzini and Greta

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO THE COMMAND LINE AND ACCESSING SERVERS

CS6220: DATA MINING TECHNIQUES Chapter 3: Data Preprocessing Instructor: Yizhou Sun

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

1 Infrastructure Requirements Limit Reuse Planned Indirect Potable Reuse (Purple pipe may be a

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Conco System s The Practical Application of Tracer Gas Leak Detection for Air Cooled Condensers

United Kingdom & United States TRACER / FSCS Combined Analysis Paper prepared by: William J

United Kingdom and United States TRACER / FSCS Combined Analysis Presentation to 16 ISMOR 2nd.

RECON 2010 - Montreal Metasm Tracer MSR NIC Plan Metasm 1 Tracer 2 MSR 3 NIC 4 A.

Tracer Methodology Stacy Olea, MBA, MT(ASCP), FACHE Executive Director Lab Accreditation April

Empowering the Creative User: Personalized HTTP-based Adaptive Streaming of Multi-path Nonlinear

Boosting the Performance of your Eclipse IDE Karsten Thoms EclipseCon Europe, Ludwigsburg,

S-Plus workshop 7-9 and 14-16 January students.washington.edu/arnima/s Statistical software

Blaise NG Key issues in current system Language enhancements Layout Unicode

HEAnet and Multimedia Sharing Brian Boyle, HEAnet Tuesday 2 March 2010 Introduction What

I N S T I T U T I O N S , P O W E R , A N D I N E Q U A L I T Y I I MPA 612: Economy, Society,

How to Shrink a Partition on Your Hard Drive shunyi 2014/09/18 Windows Computer Center, CS,

Operating Systems II Unit OS A: Networking A.2. Windows Sockets Programming Prof. Dr. Andreas

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta Franzini TABLE OF CONTENTS 1. What is preprocessing? 2. Preprocessing techniques 3. Hacking 4. Conclusion and revision 2/100 HACKING,

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION M arco B

Introduction to Packet Tracer What is Packet Tracer? Packet Tracer is a protocol simulator

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, Emily Franzini and Greta

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B uchler, Emily Franzini and Greta

TRACER TUTORIAL: TEXT REUSE DETECTION SELECTION Mar co B uchler, Emily Franzini and Greta

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO THE COMMAND LINE AND ACCESSING SERVERS

CS6220: DATA MINING TECHNIQUES Chapter 3: Data Preprocessing Instructor: Yizhou Sun

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

1 Infrastructure Requirements Limit Reuse Planned Indirect Potable Reuse (Purple pipe may be a

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Conco System s The Practical Application of Tracer Gas Leak Detection for Air Cooled Condensers

United Kingdom &amp; United States TRACER / FSCS Combined Analysis Paper prepared by: William J

United Kingdom and United States TRACER / FSCS Combined Analysis Presentation to 16 ISMOR 2nd.

RECON 2010 - Montreal Metasm Tracer MSR NIC Plan Metasm 1 Tracer 2 MSR 3 NIC 4 A.

Tracer Methodology Stacy Olea, MBA, MT(ASCP), FACHE Executive Director Lab Accreditation April

Empowering the Creative User: Personalized HTTP-based Adaptive Streaming of Multi-path Nonlinear

Boosting the Performance of your Eclipse IDE Karsten Thoms EclipseCon Europe, Ludwigsburg,

S-Plus workshop 7-9 and 14-16 January students.washington.edu/arnima/s Statistical software

Blaise NG Key issues in current system Language enhancements Layout Unicode

HEAnet and Multimedia Sharing Brian Boyle, HEAnet Tuesday 2 March 2010 Introduction What

I N S T I T U T I O N S , P O W E R , A N D I N E Q U A L I T Y I I MPA 612: Economy, Society,

How to Shrink a Partition on Your Hard Drive shunyi 2014/09/18 Windows Computer Center, CS,

Operating Systems II Unit OS A: Networking A.2. Windows Sockets Programming Prof. Dr. Andreas

United Kingdom & United States TRACER / FSCS Combined Analysis Paper prepared by: William J