tracer tutorial text reuse detection preprocessing
play

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta Franzini TABLE OF CONTENTS 1. What is preprocessing? 2. Preprocessing techniques 3. Hacking 4. Conclusion and revision 2/100 HACKING,


  1. TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B¨ uchler, Emily Franzini and Greta Franzini

  2. TABLE OF CONTENTS 1. What is preprocessing? 2. Preprocessing techniques 3. Hacking 4. Conclusion and revision 2/100

  3. HACKING, INSTALLATION & CONFIGURATION GUIDE FOR TRACER 1. Download TRACER from http://etrap.eu/tracer/ to your storage folder, e.g. /roedel/mbuechler 2. Using the command line, navigate to your storage folder with the cd command 3. U nzip archive: gunzip tracer.tar.gz 4. Untar archive: tar -xvf tracer.tar 5. Change to the TRACER folder: cd TRACER 6. Open the configuration file with vim conf/tracer config.xml 7. Configure your input file: 3/100

  4. HACKING: STARTING TRACER Start the tool with the command: java -Xmx600m -Deu.etrap.medusa.config.ClassConfig=conf/tracer config.xml -jar tracer.jar Explanation: • -Xmx600m (up to 600 MB memory); • -Dfile.encoding sets the encoding of your input file (optionally); • -Deu.etrap.medusa.config.ClassConfig (configuration file). 4/100

  5. WHAT IS PREPROCESSING?

  6. REMINDER: CURRENT APPROACH 6/100

  7. PRE-STEP: SEGMENTATION - AN EXAMPLE 7/100

  8. PRE-STEP: SEGMENTATION 8/100

  9. QUESTION What do you associate with preprocessing ? 9/100

  10. FOUNDATIONS FOR PREPROCESSING: ZIPFIAN LAW 10/100

  11. IMPLICATIONS OF THE ZIPFIAN LAW • Approx. 50% of all words occur only once • Approx. 16% of all words occur only twice • Approx. 8% of all words occur three times • ... • Approx. 90% of all words in a corpus occur 10 times or less n 1 1 s n ( f ) = � s ( f ) = f ∗ ( f + 1 ) f ∗ ( f + 1 ) f = 1 • The top 300-700 most frequent words cover already about 50% of all tokens (depending language) 11/100

  12. QUESTION What does lemmatisation mean for this plot? 12/100

  13. PREPROCESSING TECHNIQUES

  14. PREPROCESSING 14/100

  15. PREPROCESSING: DIRECTED GRAPH NORMALISATION E.g. lemmatisation 15/100

  16. PREPROCESSING: INDIRECTED GRAPH NORMALISATION E.g. synonyms, string similarity 16/100

  17. HACKING

  18. HACKING Tasks: • Run on your texts ... 1. ... without preprocessing 2. ... 1) + lemmatisation 3. ... 2) + synonym replacement 18/100

  19. HACKING Questions: • Compare the input file with the *.prep file for all preprocessing techniques. Which methods seem to work best for you? Which make no sense for the dataset? • Compare all *.meta files containing some numbers! How many words have changed and through which method? • (optional and advanced) What is the number of word types for each preprocessing technique (can be derived from the first column of *.prep.inv ). 19/100

  20. PREPROCESSING 1) WITHOUT PREPROCESSING Hint: • The configuration file can be found in: $ TRACER HOME/conf/tracer conf.xml • All values show false . 20/100

  21. PREPROCESSING 2) REMOVING DIACHRITICS Hint: • boolRemoveDiachritics is switched on by value true . 21/100

  22. PREPROCESSING 4) LEMMATISING TEXT Hint: • boolLemmatisation is switched on by value true . • Lemmatisation can be configured by: < property name="BASEFORM FILE NAME" value="data/corpora/Bible/Bible.lemma" / > 22/100

  23. PREPROCESSING 5) SYNONYM HANDLING Hint: • boolReplaceSynonyms is switched on by value true . • Synonyms can be configured by: < property name="SYNONYMS FILE NAME" value="data/corpora/Bible/Bible.syns" / > 23/100

  24. PREPROCESSING 6) STRING SIMILARITY FOR NORMALISING VARIANTS Hint: • boolReplaceStringSimilarWords is switched on by value true . • Thresholds: < property name="SYNONYMS FILE NAME" value="data/corpora/Bible/Bible.syns" / > 24/100

  25. OPEN ISSUE: FRAGMENTARY WORDS 25/100

  26. OPEN ISSUE: FRAGMENTARY WORDS - DEALING WITH GAPS AND LEIDEN CONVENTIONS 26/100

  27. GAP BETWEEN KNOWLEDGE AND EXPERIENCE 27/100

  28. CONCLUSION AND REVISION

  29. CHECK Statement: • ”My lemmatisation tool < XYZ > is able to compute the base forms of 80% of all tokens in a corpus.” Good or bad? 29/100

  30. CHECK Fact file: • Language variants • Different writing styles • (Some) dialects • Diachritics • OCR errors Question: What’s the difference for you? 30/100

  31. CHECK Fact file: • Language variants • Different writing styles • (Some) dialects • Diachritics • OCR errors Question: What do you think is the difference for the computer? 31/100

  32. IMPORTANCE OF PREPROCESSING • Cleaning and harmonising the data. • When working with a new corpus -not only the language but also the same language in different epochs or geographical regions- cleaning/harmonising the data can take up to 70% of the overall time. Preprocessing mantra: Garbage in, garbage out 32/100

  33. FINITO! 33/100

  34. CONTACT Team Marco B¨ uchler, Greta Franzini and Emily Franzini. Visit us http://www.etrap.eu contact@etrap.eu 34/100

  35. LICENCE The theme this presentation is based on is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Changes to the theme are the work of eTRAP. cba 35/100

Recommend


More recommend