TRACER - Preprocessing Marco Bchler, Emily Franzini, Greta Franzini, - PowerPoint PPT Presentation

TRACER - Preprocessing Marco Büchler, Emily Franzini, Greta Franzini, Maria Moritz eTRAP Research Group Göttingen Centre for Digital Humanities Institute of Computer Science Georg August University Göttingen, Germany 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Hacking – Installation & configuration guide for TRACER 1) Copy Tracer from /storage/tracer.tar.gz to your storage folder such as /storage/mbuechler 2) Change to your storage folder with cd command 3) Unzip archive: gunzip tracer.tar.gz 4) Untar archive: tar -xvf tracer.tar 5) Change to tracer folder: cd Tracer 6) Open the config file with vim conf/tracer_config.xml 7) Configure your input file: 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Hacking - Starting TRACER 1) Start the tool with the command: java -Xmx600m -Dde.gcdh.medusa.config.ClassConfig=conf/tracer_config.xml -jar tracer.jar Explanation: -Xmx600m (up to 600 MB memory), -Dfile.encoding sets the encoding of your input file (optionally), -Dde.gcdh.medusa.config.ClassConfig (configuration file) 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Overview • What is preprocessing? • Overview of preprocessing techniques • Hacking • Conclusion with some test questions 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Reminder: Current approach 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Pre-step: Segmentation - an example 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Pre-step: Segmentation 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Question What do you associated with preprocessing? 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Foundations for preprocessing – Zipfian Law 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Implications of the Zipfian Law • Approx. 50% of all words occur only once • Approx. 16% of all words occur only twice • Approx. 8% of all words occur three times • ... • Approx. 90% of all words in a corpus occur 10 times or less • The top 300 – 700 most frequent words cover already about 50% of all tokens (depending language) 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Question • What does lemmatisation mean for this plot? 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Preprocessing 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Preprocessing: Directed Graph Normalisation • e.g. lemmatisation 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Preprocessing: Indirected Graph Normalisation • e.g. synonyms, string similarity 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Hacking • Tasks: – Run on your texts ... 1) ... without preprocessing 2) ... 1) + lemmatisation 3) ... 2) + synonym replacement 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Hacking • Questions: – Compare the input file with the *.prep file for all preprocessing techniques. Which methods seems to work best for you? Which does make no sense for the dataset? – Compare all *.meta files containing some numbers! How many words have changed and by which method? – (optional and advanced) what is the number of word types for each preprocessing (can be derived from *.prep.inv first column) 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Preprocessing – 1) without preprocessing • Hint: – Configuration file can be found in ${TRACER_HOME}/conf/tracer_conf.xml – All values show „false“ 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Preprocessing – 2) Removing diachritics • Hint: – BoolRemoveDiachritics is switched on by value true 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Preprocessing – 3) Lower case • Hint: – boolMakeAllLowerCase is switched on by value true 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Preprocessing – 4) Lemmatising text • Hint: – boolLemmatisation is switched on by value true – Lemmatisation can be configured by <property name="BASEFORM_FILE_NAME" value="data/corpora/Bible/Bible.lemma" /> 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Preprocessing – 5) Synonym handling • Hint: – boolReplaceSynonyms is switched on by value true – Synonyms can be configured by <property name="SYNONYMS_FILE_NAME" value="data/corpora/Bible/Bible.syns" /> 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Preprocessing – 6) String similarity for normalising variants • Hint: – boolReplaceStringSimilarWords is switched on by value true – Thresholds: 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Open issue: Fragmentary words 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Open issue: Fragmentary words – dealing with gaps and Leiden Convention Οὐιβίῳ Ἀλεξά̤[ν]δρῳ τῷ κρατίστῳ ἐπιστρατήγῳ παρὰ Ἀντ[ωνίου Δ]όμνά̤ου τοῦ καὶ Φιλαντι[νό]οά̤υά̤ Ἀντωνίοά̤[υ Ῥωμανο]ῦά̤ Τραιανείου τοῦ καά̤[ὶ Στρα]τά̤είου Ἀντινοέως. [οὐκ ἂν] εἰς τοῦτο προήχθά̤[η]νά̤, ἐά̤πι- τρόπων [μέγιστ]εά̤, μέ[τριος] καὶ ἀπρά̤γά̤μων ὢνά̤ ἄνθρά̤[ωπος,] εά̤ἰ μὴ [ὓβρι]ν τὴν μά̤[εγ]ίστηνά̤ ἐπά̤επόνθ[ειν ὑπὸ] Ὡρίωνο[ς κ]ωά̤μογρα[μ]μά̤ατέως Φ[ι]λαδελφείά̤[ας τῆ]ς Ἡρακλεά̤ίά̤δου μερίδοά̤[ς] τά̤οῦ Ἀρά̤σινοίτου. [οὗ χά]ριν μην[ύ]ω παρὰ τ[ὰ ἀ]πει- ρημένα ἑαά̤[υτὸ]νά̤ ἐνσείσανά̤τα εἰς τὴν κωμο- γραμματείανά̤ [μ]ήτε σιτολογήσαντα μήτε πρά̤[α]κτορεύσαντά̤α παντελῶς ἄπορον ὄν[τ]αά̤. δι᾽ ἣά̤ν αἰτίαν κά̤αὶ πρότερον οὐ διέλιπον ἐντυγ- χά̤νων καὶ νῦά̤ν ἀξιῶ, ἐάν σου τῇ τύχῃ δόξ[ῃ], ἀκοῦσά̤αί μου π[ρ]ὸς αὐτὸν πρὸς τὸ τυχεῖν με τά̤ῆά̤ςά̤ ἀπὸ σοῦ [μι]σοπονήρου ἐγδ[ι]κίας, ἵν᾽ ὦ ὑπὸά̤ [σ]οά̤ῦά̤ κατὰά̤ πά̤άντα βά̤εά̤βοηθ(ημένος). διευτύχει Ἀντώνιος Δόμνά̤οά̤ς ἐπιδέδωκα. 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Gap between knowledge and experience 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Test questions • Statement: – „My lemmatisation tool <XYZ> is able to compute the baseforms of 80% of all tokens in a corpus.“ Good or bad??? 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Test questions • Fact file: – Language variants – Different writing styles – (some) Dialects – Diachritics – OCR errors • Question: What is the difference for you? 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Test questions • Fact file: – Language variants – Different writing styles – (some) Dialects – Diachritics – OCR errors • Question: What do you think is the difference for the computer? 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Importance of preprocessing • Cleaning and harmonising the data • When working with a new corpus (not only language but also same language in a different epoch or geographical region can take up to 70% of the overall time. • Preprocessing mantra: Garbage in, garbage out. 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

Thank you! " Stealing from one is plagiarism, stealing from many is research " (Wilson Mitzner, 1876-1933) Visit us at http://etrap.gcdh.de 2015 DH Estonia – Text Reuse Hackathon 20. Oktober 2015

TRACER - Preprocessing Marco Bchler, Emily Franzini, Greta Franzini, - PowerPoint PPT Presentation

TRACER - Preprocessing Marco Bchler, Emily Franzini, Greta Franzini, Maria Moritz eTRAP Research Group Gttingen Centre for Digital Humanities Institute of Computer Science Georg August University Gttingen, Germany 2015 DH Estonia

Introduction to Packet Tracer What is Packet Tracer? Packet Tracer is a protocol simulator

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta

CS6220: DATA MINING TECHNIQUES Chapter 3: Data Preprocessing Instructor: Yizhou Sun

United Kingdom & United States TRACER / FSCS Combined Analysis Paper prepared by: William J

United Kingdom and United States TRACER / FSCS Combined Analysis Presentation to 16 ISMOR 2nd.

RECON 2010 - Montreal Metasm Tracer MSR NIC Plan Metasm 1 Tracer 2 MSR 3 NIC 4 A.

Tracer Methodology Stacy Olea, MBA, MT(ASCP), FACHE Executive Director Lab Accreditation April

Data Preprocessing Why Data Preprocessing? Chris Williams, School of Informatics University of

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for

Conco System s The Practical Application of Tracer Gas Leak Detection for Air Cooled Condensers

An Architecture for Tracing Incidents across the Internet Glenn Mansfield Keeni Cyber Solutions

UNIONS OF ONIONS Maarten L offler Wolfgang Mulzer Universiteit Utrecht Freie Universit at

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

ATIR April 28, 2016 Motivation Simple Preprocessing Linguistics Further Preprocessing

Preprocessing input data for machine learning by FCA Jan OUTRATA Dept. Computer Science

Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A

Introductory Chemical Engineering Thermodynamics By J.R. Elliott and C.T. Lira The fundamental

Literary Text Mining and Stylometry DH Crash Course Andreas van Cranenburgh Huygens ING

Discrete Holomorphicity in the Chiral Potts Model Robert Weston Heriot-Watt University, Edinburgh

W3C Workshop on Access Control Application Scenarios November 17 th 2009 Luxembourg Outlines

qDSA: Small and Secure Digital Signatures with Curve-based Diffie-Hellman Key Pairs Joost Renes 1

MATH 12002 - CALCULUS I 2.7: Related Rates Part 2: Examples Professor Donald L. White

Differential Privacy for Relational Algebra: improving the sensitivity bounds via constraint

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

TRACER - Preprocessing Marco Bchler, Emily Franzini, Greta Franzini, - PowerPoint PPT Presentation

TRACER - Preprocessing Marco Bchler, Emily Franzini, Greta Franzini, Maria Moritz eTRAP Research Group Gttingen Centre for Digital Humanities Institute of Computer Science Georg August University Gttingen, Germany 2015 DH Estonia

Introduction to Packet Tracer What is Packet Tracer? Packet Tracer is a protocol simulator

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta

CS6220: DATA MINING TECHNIQUES Chapter 3: Data Preprocessing Instructor: Yizhou Sun

United Kingdom &amp; United States TRACER / FSCS Combined Analysis Paper prepared by: William J

United Kingdom and United States TRACER / FSCS Combined Analysis Presentation to 16 ISMOR 2nd.

RECON 2010 - Montreal Metasm Tracer MSR NIC Plan Metasm 1 Tracer 2 MSR 3 NIC 4 A.

Tracer Methodology Stacy Olea, MBA, MT(ASCP), FACHE Executive Director Lab Accreditation April

Data Preprocessing Why Data Preprocessing? Chris Williams, School of Informatics University of

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for

Conco System s The Practical Application of Tracer Gas Leak Detection for Air Cooled Condensers

An Architecture for Tracing Incidents across the Internet Glenn Mansfield Keeni Cyber Solutions

UNIONS OF ONIONS Maarten L offler Wolfgang Mulzer Universiteit Utrecht Freie Universit at

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

ATIR April 28, 2016 Motivation Simple Preprocessing Linguistics Further Preprocessing

Preprocessing input data for machine learning by FCA Jan OUTRATA Dept. Computer Science

Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A

Introductory Chemical Engineering Thermodynamics By J.R. Elliott and C.T. Lira The fundamental

Literary Text Mining and Stylometry DH Crash Course Andreas van Cranenburgh Huygens ING

Discrete Holomorphicity in the Chiral Potts Model Robert Weston Heriot-Watt University, Edinburgh

W3C Workshop on Access Control Application Scenarios November 17 th 2009 Luxembourg Outlines

qDSA: Small and Secure Digital Signatures with Curve-based Diffie-Hellman Key Pairs Joost Renes 1

MATH 12002 - CALCULUS I 2.7: Related Rates Part 2: Examples Professor Donald L. White

Differential Privacy for Relational Algebra: improving the sensitivity bounds via constraint

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

United Kingdom & United States TRACER / FSCS Combined Analysis Paper prepared by: William J