a pipeline for scalable text reuse analysis
play

A Pipeline for Scalable Text Reuse Analysis Milad Alshomary - PowerPoint PPT Presentation

A Pipeline for Scalable Text Reuse Analysis Milad Alshomary Bauhaus Universitt 05.07.2018 Milad Alshomary Pipeline for TR extraction 05.07.2018 1 Overview Motivation A Pipeline for Scalable Text Reuse Extraction Application on


  1. A Pipeline for Scalable Text Reuse Analysis Milad Alshomary Bauhaus Universität 05.07.2018 Milad Alshomary Pipeline for TR extraction 05.07.2018 1

  2. Overview ● Motivation ● A Pipeline for Scalable Text Reuse Extraction ● Application on Wikipedia ● Application on Wikipedia and Common Crawl ● Conclusion Milad Alshomary Pipeline for TR extraction 05.07.2018 2

  3. Motivation Text Reuse (TR) ● Quoting ● Verbatim ● Paraphrasing ● Translation ● Summarization Milad Alshomary Pipeline for TR extraction 05.07.2018 3

  4. Motivation TR Detection Applications Plagiarism detection METER project (Measuring Text Reuse) Milad Alshomary Pipeline for TR extraction 05.07.2018 4

  5. Motivation TR Detection Applications Plagiarism detection METER projet (Measuring Text Reuse) Milad Alshomary Pipeline for TR extraction 05.07.2018 5

  6. Motivation TR Detection Applications Plagiarism detection METER projet (Measuring Text Reuse) Milad Alshomary Pipeline for TR extraction 05.07.2018 6

  7. Motivation Wikipedia vs The World ● Digital Encyclopedia ● Collaborative environment ● Giant public source of information ● Free to use Milad Alshomary Pipeline for TR extraction 05.07.2018 7

  8. Motivation Wikipedia vs The World ● Digital Encyclopedia ● Collaborative environment ● Giant public source of information ● Free to use Milad Alshomary Pipeline for TR extraction 05.07.2018 8

  9. Motivation Wikipedia vs The World ● Digital Encyclopedia ● Collaborative environment ● Giant public source of information ● Free to use Milad Alshomary Pipeline for TR extraction 05.07.2018 9

  10. Motivation Wikipedia vs The World ● Digital Encyclopedia ● Collaborative environment ● Giant public source of information ● Free to use Milad Alshomary Pipeline for TR extraction 05.07.2018 10

  11. Motivation Wikipedia vs The World Quality Flaws Scientific community Milad Alshomary Pipeline for TR extraction 05.07.2018 11

  12. Motivation Wikipedia vs The World - Web pages = Wikipedia text + advertisements Milad Alshomary Pipeline for TR extraction 05.07.2018 12

  13. Motivation Research Questions ➔ What kinds of text reuse occur within Wikipedia? ➔ How much of the web is a copy of Wikipedia content? ➔ How much revenue does this content generate? Milad Alshomary Pipeline for TR extraction 05.07.2018 13

  14. Motivation Research Questions ➔ What kinds of text reuse occur within Wikipedia? ➔ How much of the web is a copy of Wikipedia content? ➔ How much revenue does this content generate? Milad Alshomary Pipeline for TR extraction 05.07.2018 14

  15. Motivation Research Questions ➔ What kinds of text reuse occur within Wikipedia? ➔ How much of the web is a copy of Wikipedia content? ➔ How much revenue does this content generate? Milad Alshomary Pipeline for TR extraction 05.07.2018 15

  16. A Pipeline for Scalable Text Reuse Extraction Milad Alshomary Pipeline for TR extraction 05.07.2018 16

  17. A Pipeline for Scalable Text Reuse Extraction Text Reuse Pipeline D1 TR Pipeline Input: Two datasets ➔ Output: Text reuse ➔ D2 cases Milad Alshomary Pipeline for TR extraction 05.07.2018 17

  18. A Pipeline for Scalable Text Reuse Extraction Text Reuse Pipeline D1 TR Pipeline Input: Two datasets ➔ Output: Text reuse ➔ D2 cases Milad Alshomary Pipeline for TR extraction 05.07.2018 18

  19. A Pipeline for Scalable Text Reuse Extraction Text Reuse Pipeline Text Candidate Text Preprocessing Elimination Alignment Content extraction ➔ Chunking ➔ Feature extraction ➔ Milad Alshomary Pipeline for TR extraction 05.07.2018 19

  20. A Pipeline for Scalable Text Reuse Extraction Text Reuse Pipeline Text Candidate Text Preprocessing Elimination Alignment Content extraction Pairwise scan ➔ ➔ Chunking Text Reuse heuristics ➔ ➔ Feature extraction ➔ Milad Alshomary Pipeline for TR extraction 05.07.2018 20

  21. A Pipeline for Scalable Text Reuse Extraction Text Reuse Pipeline Text Candidate Text Preprocessing Elimination Alignment Content extraction Pairwise scan Detailed scan of text ➔ ➔ ➔ Chunking Text Reuse heuristics reuse ➔ ➔ Feature extraction Picapica framework ➔ ➔ Milad Alshomary Pipeline for TR extraction 05.07.2018 21

  22. A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Text Candidate Text Preprocessing Elimination Alignment Keys for scaling-up: Cluster computing ➔ Heuristics based candidate elimination ➔ algorithms Milad Alshomary Pipeline for TR extraction 05.07.2018 22

  23. A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Text Candidate Text Preprocessing Elimination Alignment Keys for scaling-up: Cluster computing ➔ Heuristics based candidate elimination ➔ algorithms Milad Alshomary Pipeline for TR extraction 05.07.2018 23

  24. A Pipeline for Scalable Text Reuse Extraction Candidate Elimination D 1 D 2 For a candidacy function we proposed the following methods: - Cosine similarity of TF-IDF (semantic) - Paragraph embedding (semantic) candidacy(d 11 , d 21 ) → [0, 1] - Stopwords N-grams (structure) - Weighted average of Stopwords Ngrams and d 11 d 21 Paragraph embedding (semantic + structure) d 22 d 12 d 2n d 1n Milad Alshomary Pipeline for TR extraction 05.07.2018 24

  25. A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Sample 1k documents Document Generate TR Sample from Wikipedia: Wikipedia Sample - Sample 1k documents from Wikipedia Text alignment using - Using Picapica framework to find picapica framework TR cases TR sample Milad Alshomary Pipeline for TR extraction 05.07.2018 25

  26. A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Sample 1k documents Document Generate TR Sample from Wikipedia: Wikipedia Sample - Sample 1k documents from Wikipedia Text alignment using - Using Picapica framework to find picapica framework TR cases TR sample Milad Alshomary Pipeline for TR extraction 05.07.2018 26

  27. A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Sample 1k documents Document Generate TR Sample from Wikipedia: Wikipedia Sample - Sample 1k documents from Wikipedia Text alignment using - Using Picapica framework to find picapica framework TR cases TR sample - 232 documents - ~ 90% have < 10 alignements (TR case) Milad Alshomary Pipeline for TR extraction 05.07.2018 27

  28. A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Evaluation of “ candidacy” function: TR sample - For each document in TR sample: - Sort all Wikipedia articles according to the proposed “ candidacy” . - Precision/Recall on Thresholds of [1, 101,..,100k] - A True Positive (TP) is a pair of documents that have TR. T1 T2 T3 Milad Alshomary Pipeline for TR extraction 05.07.2018 28

  29. A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Evaluation of “ candidacy” function: TR sample p1 p2 r2 r1 T1 T2 T3 Milad Alshomary Pipeline for TR extraction 05.07.2018 29

  30. A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Semantic hashing function: - Hashes documents into binary hashes. - Similar documents get similar or exact binary hash. 011001 011001 Milad Alshomary Pipeline for TR extraction 05.07.2018 30

  31. A Pipeline for Scalable Text Reuse Extraction Candidate Elimination D 2 D 1 Semantic hashing function: - Hashing all documents. - Inverted index. - Hash document’s chunks. - Apply candidacy function only on Inverted index documents that intersect in one 001001 hash at least. 011001 011001 011001 001000 Milad Alshomary Pipeline for TR extraction 05.07.2018 31

  32. A Pipeline for Scalable Text Reuse Extraction Candidate Elimination D 2 D 1 Semantic hashing function: - Hashing all documents. - Inverted index. - Hash document’s chunks. - Apply candidacy function only on Inverted index documents that intersect in one 001001 hash at least. 011001 011001 011001 001000 Milad Alshomary Pipeline for TR extraction 05.07.2018 32

  33. A Pipeline for Scalable Text Reuse Extraction Candidate Elimination D 2 D 1 Semantic hashing function: - Hashing all documents. - Inverted index. - Hash document’s chunks. - Apply candidacy function only on Inverted index documents that intersect in one 001001 hash at least. 011001 011001 011001 001000 Milad Alshomary Pipeline for TR extraction 05.07.2018 33

  34. A Pipeline for Scalable Text Reuse Extraction Candidate Elimination D 2 D 1 Semantic hashing function: - Hashing all documents. - Inverted index. - Hash document’s chunks. - Apply candidacy function only on Inverted index documents that intersect in one 001001 hash at least. 011001 011001 011001 001000 Milad Alshomary Pipeline for TR extraction 05.07.2018 34

  35. A Pipeline for Scalable Text Reuse Extraction Candidate Elimination d i Proposed semantic hashing methods: - Random Projection (data independent) - Variational Deep Semantic Hashing (data dependent) d j Milad Alshomary Pipeline for TR extraction 05.07.2018 35

Recommend


More recommend