TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B¨ uchler, Emily Franzini and Greta Franzini
TABLE OF CONTENTS 1. What is featuring? 2. Featuring techniques 3. Hacking 4. Conclusion and revision 2/27
REMINDER: CURRENT APPROACH 3/27
WHAT IS FEATURING?
QUESTION What do you associate with featuring? 5/27
A VISUALISATION OF FEATURING From biometry: 6/27
SOME VOCABULARY 7/27
FEATURING 8/27
FEATURING TECHNIQUES
FEATURING: AN EXAMPLE V 1 = s 1 , s 2 , s 3 , s 4 , s 5 12 Features V 2 = A , B , ..., J , K s 1 : A B C D E s 2 : A C E F G s 3 : G F A C D s 4 : C F A G E s 5 : D H I J K 10/27
FEATURING: MATRIX STYLE s 1 : A B C D E s 2 : A C E F G s 3 : G F A C D s 4 : C F A G E s 5 : D H I J K A C D F G E B H I J K s 1 1 1 1 0 0 1 1 0 0 0 0 s 2 1 1 0 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 = M s 3 s 4 1 1 0 1 1 1 0 0 0 0 0 s 5 0 0 1 0 0 0 0 1 1 1 1 11/27
HACKING: CONFIGURATION 12/27
HACKING
HACKING Tasks: • Run on your own texts ... 1. ... n-gram shingling with n=2, 3 2. ... words as features 14/27
HACKING Questions: • Run the aforementioned tasks. Compare the resulting ”tail distributions” (in the featuring folder you’ll find all this information in e.g. KJV.meta ). • Compare the .train -files of n-gram shingling with hash-breaking, also compared to words as features (use Excel or OpenOffice to open the .train file; sort by columns B and C). 15/27
CONFIGURING THE TRAINING IMPL PARAMETER Hint: The configuration file can be found in: $ TRACER HOME/conf/tracer conf.xml 16/27
CONFIGURING THE TRAINING IMPL PARAMETER Hint: • The configuration file can be found in: $ TRACER HOME/conf/tracer conf.xml • eu.etrap.tracer.featuring.syntactical.shingle. TriGramShinglingTrainingImpl • eu.etrap.tracer.featuring.syntactical.shingle. BiGramShinglingTrainingImpl • eu.etrap.tracer.featuring.semantic. WordBasedTrainingImpl 17/27
GAP BETWEEN KNOWLEDGE AND EXPERIENCE 18/27
CONCLUSION AND REVISION
CHECK Question: How does the number of features change with the changing feature size (e.g. bigrams, trigrams)? 20/27
CHECK A C D F G E B H I J K s 1 1 1 1 0 0 1 1 0 0 0 0 s 2 1 1 0 1 1 1 0 0 0 0 0 s 3 1 1 1 1 1 0 0 0 0 0 0 = M 1 1 0 1 1 1 0 0 0 0 0 s 4 s 5 0 0 1 0 0 0 0 1 1 1 1 Question: What is the Digital Fingerprint of a reuse unit? 21/27
CHECK A C D F G E B H I J K s 1 1 1 1 0 0 1 1 0 0 0 0 s 2 1 1 0 1 1 1 0 0 0 0 0 s 3 1 1 1 1 1 0 0 0 0 0 0 = M 1 1 0 1 1 1 0 0 0 0 0 s 4 s 5 0 0 1 0 0 0 0 1 1 1 1 Question: How does preprocessing influence F? 22/27
CHECK A C D F G E B H I J K s 1 1 1 1 0 0 1 1 0 0 0 0 s 2 1 1 0 1 1 1 0 0 0 0 0 s 3 1 1 1 1 1 0 0 0 0 0 0 = M 1 1 0 1 1 1 0 0 0 0 0 s 4 s 5 0 0 1 0 0 0 0 1 1 1 1 Question: How can you compute the feature frequency? 23/27
IMPORTANCE OF FEATURING • Featuring defines the unit to measure similarity. • Most featuring techniques ”generate” a power-law distribution: • A few features occur very often; • At least 50% of all features occur just once; • Most features are rare. 24/27
FINITO! 25/27
CONTACT Team Marco B¨ uchler, Greta Franzini and Emily Franzini. Visit us http://www.etrap.eu contact@etrap.eu 26/27
LICENCE The theme this presentation is based on is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Changes to the theme are the work of eTRAP. cba 27/27
Recommend
More recommend