tracer tutorial text reuse detection featuring
play

TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B uchler, - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B uchler, Emily Franzini and Greta Franzini TABLE OF CONTENTS 1. What is featuring? 2. Featuring techniques 3. Hacking 4. Conclusion and revision 2/27 REMINDER: CURRENT APPROACH 3/27


  1. TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B¨ uchler, Emily Franzini and Greta Franzini

  2. TABLE OF CONTENTS 1. What is featuring? 2. Featuring techniques 3. Hacking 4. Conclusion and revision 2/27

  3. REMINDER: CURRENT APPROACH 3/27

  4. WHAT IS FEATURING?

  5. QUESTION What do you associate with featuring? 5/27

  6. A VISUALISATION OF FEATURING From biometry: 6/27

  7. SOME VOCABULARY 7/27

  8. FEATURING 8/27

  9. FEATURING TECHNIQUES

  10. FEATURING: AN EXAMPLE V 1 = s 1 , s 2 , s 3 , s 4 , s 5 12 Features V 2 = A , B , ..., J , K s 1 : A B C D E s 2 : A C E F G s 3 : G F A C D s 4 : C F A G E s 5 : D H I J K 10/27

  11. FEATURING: MATRIX STYLE s 1 : A B C D E s 2 : A C E F G s 3 : G F A C D s 4 : C F A G E s 5 : D H I J K A C D F G E B H I J K   s 1 1 1 1 0 0 1 1 0 0 0 0 s 2 1 1 0 1 1 1 0 0 0 0 0     1 1 1 1 1 0 0 0 0 0 0 = M s 3       s 4 1 1 0 1 1 1 0 0 0 0 0   s 5 0 0 1 0 0 0 0 1 1 1 1 11/27

  12. HACKING: CONFIGURATION 12/27

  13. HACKING

  14. HACKING Tasks: • Run on your own texts ... 1. ... n-gram shingling with n=2, 3 2. ... words as features 14/27

  15. HACKING Questions: • Run the aforementioned tasks. Compare the resulting ”tail distributions” (in the featuring folder you’ll find all this information in e.g. KJV.meta ). • Compare the .train -files of n-gram shingling with hash-breaking, also compared to words as features (use Excel or OpenOffice to open the .train file; sort by columns B and C). 15/27

  16. CONFIGURING THE TRAINING IMPL PARAMETER Hint: The configuration file can be found in: $ TRACER HOME/conf/tracer conf.xml 16/27

  17. CONFIGURING THE TRAINING IMPL PARAMETER Hint: • The configuration file can be found in: $ TRACER HOME/conf/tracer conf.xml • eu.etrap.tracer.featuring.syntactical.shingle. TriGramShinglingTrainingImpl • eu.etrap.tracer.featuring.syntactical.shingle. BiGramShinglingTrainingImpl • eu.etrap.tracer.featuring.semantic. WordBasedTrainingImpl 17/27

  18. GAP BETWEEN KNOWLEDGE AND EXPERIENCE 18/27

  19. CONCLUSION AND REVISION

  20. CHECK Question: How does the number of features change with the changing feature size (e.g. bigrams, trigrams)? 20/27

  21. CHECK A C D F G E B H I J K   s 1 1 1 1 0 0 1 1 0 0 0 0 s 2 1 1 0 1 1 1 0 0 0 0 0     s 3 1 1 1 1 1 0 0 0 0 0 0 = M      1 1 0 1 1 1 0 0 0 0 0  s 4   s 5 0 0 1 0 0 0 0 1 1 1 1 Question: What is the Digital Fingerprint of a reuse unit? 21/27

  22. CHECK A C D F G E B H I J K   s 1 1 1 1 0 0 1 1 0 0 0 0 s 2 1 1 0 1 1 1 0 0 0 0 0     s 3 1 1 1 1 1 0 0 0 0 0 0 = M      1 1 0 1 1 1 0 0 0 0 0  s 4   s 5 0 0 1 0 0 0 0 1 1 1 1 Question: How does preprocessing influence F? 22/27

  23. CHECK A C D F G E B H I J K   s 1 1 1 1 0 0 1 1 0 0 0 0 s 2 1 1 0 1 1 1 0 0 0 0 0     s 3 1 1 1 1 1 0 0 0 0 0 0 = M      1 1 0 1 1 1 0 0 0 0 0  s 4   s 5 0 0 1 0 0 0 0 1 1 1 1 Question: How can you compute the feature frequency? 23/27

  24. IMPORTANCE OF FEATURING • Featuring defines the unit to measure similarity. • Most featuring techniques ”generate” a power-law distribution: • A few features occur very often; • At least 50% of all features occur just once; • Most features are rare. 24/27

  25. FINITO! 25/27

  26. CONTACT Team Marco B¨ uchler, Greta Franzini and Emily Franzini. Visit us http://www.etrap.eu contact@etrap.eu 26/27

  27. LICENCE The theme this presentation is based on is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Changes to the theme are the work of eTRAP. cba 27/27

Recommend


More recommend