A Pipeline for Scalable Text Reuse Analysis Milad Alshomary Bauhaus Universität 05.07.2018 Milad Alshomary Pipeline for TR extraction 05.07.2018 1
Overview ● Motivation ● A Pipeline for Scalable Text Reuse Extraction ● Application on Wikipedia ● Application on Wikipedia and Common Crawl ● Conclusion Milad Alshomary Pipeline for TR extraction 05.07.2018 2
Motivation Text Reuse (TR) ● Quoting ● Verbatim ● Paraphrasing ● Translation ● Summarization Milad Alshomary Pipeline for TR extraction 05.07.2018 3
Motivation TR Detection Applications Plagiarism detection METER project (Measuring Text Reuse) Milad Alshomary Pipeline for TR extraction 05.07.2018 4
Motivation TR Detection Applications Plagiarism detection METER projet (Measuring Text Reuse) Milad Alshomary Pipeline for TR extraction 05.07.2018 5
Motivation TR Detection Applications Plagiarism detection METER projet (Measuring Text Reuse) Milad Alshomary Pipeline for TR extraction 05.07.2018 6
Motivation Wikipedia vs The World ● Digital Encyclopedia ● Collaborative environment ● Giant public source of information ● Free to use Milad Alshomary Pipeline for TR extraction 05.07.2018 7
Motivation Wikipedia vs The World ● Digital Encyclopedia ● Collaborative environment ● Giant public source of information ● Free to use Milad Alshomary Pipeline for TR extraction 05.07.2018 8
Motivation Wikipedia vs The World ● Digital Encyclopedia ● Collaborative environment ● Giant public source of information ● Free to use Milad Alshomary Pipeline for TR extraction 05.07.2018 9
Motivation Wikipedia vs The World ● Digital Encyclopedia ● Collaborative environment ● Giant public source of information ● Free to use Milad Alshomary Pipeline for TR extraction 05.07.2018 10
Motivation Wikipedia vs The World Quality Flaws Scientific community Milad Alshomary Pipeline for TR extraction 05.07.2018 11
Motivation Wikipedia vs The World - Web pages = Wikipedia text + advertisements Milad Alshomary Pipeline for TR extraction 05.07.2018 12
Motivation Research Questions ➔ What kinds of text reuse occur within Wikipedia? ➔ How much of the web is a copy of Wikipedia content? ➔ How much revenue does this content generate? Milad Alshomary Pipeline for TR extraction 05.07.2018 13
Motivation Research Questions ➔ What kinds of text reuse occur within Wikipedia? ➔ How much of the web is a copy of Wikipedia content? ➔ How much revenue does this content generate? Milad Alshomary Pipeline for TR extraction 05.07.2018 14
Motivation Research Questions ➔ What kinds of text reuse occur within Wikipedia? ➔ How much of the web is a copy of Wikipedia content? ➔ How much revenue does this content generate? Milad Alshomary Pipeline for TR extraction 05.07.2018 15
A Pipeline for Scalable Text Reuse Extraction Milad Alshomary Pipeline for TR extraction 05.07.2018 16
A Pipeline for Scalable Text Reuse Extraction Text Reuse Pipeline D1 TR Pipeline Input: Two datasets ➔ Output: Text reuse ➔ D2 cases Milad Alshomary Pipeline for TR extraction 05.07.2018 17
A Pipeline for Scalable Text Reuse Extraction Text Reuse Pipeline D1 TR Pipeline Input: Two datasets ➔ Output: Text reuse ➔ D2 cases Milad Alshomary Pipeline for TR extraction 05.07.2018 18
A Pipeline for Scalable Text Reuse Extraction Text Reuse Pipeline Text Candidate Text Preprocessing Elimination Alignment Content extraction ➔ Chunking ➔ Feature extraction ➔ Milad Alshomary Pipeline for TR extraction 05.07.2018 19
A Pipeline for Scalable Text Reuse Extraction Text Reuse Pipeline Text Candidate Text Preprocessing Elimination Alignment Content extraction Pairwise scan ➔ ➔ Chunking Text Reuse heuristics ➔ ➔ Feature extraction ➔ Milad Alshomary Pipeline for TR extraction 05.07.2018 20
A Pipeline for Scalable Text Reuse Extraction Text Reuse Pipeline Text Candidate Text Preprocessing Elimination Alignment Content extraction Pairwise scan Detailed scan of text ➔ ➔ ➔ Chunking Text Reuse heuristics reuse ➔ ➔ Feature extraction Picapica framework ➔ ➔ Milad Alshomary Pipeline for TR extraction 05.07.2018 21
A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Text Candidate Text Preprocessing Elimination Alignment Keys for scaling-up: Cluster computing ➔ Heuristics based candidate elimination ➔ algorithms Milad Alshomary Pipeline for TR extraction 05.07.2018 22
A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Text Candidate Text Preprocessing Elimination Alignment Keys for scaling-up: Cluster computing ➔ Heuristics based candidate elimination ➔ algorithms Milad Alshomary Pipeline for TR extraction 05.07.2018 23
A Pipeline for Scalable Text Reuse Extraction Candidate Elimination D 1 D 2 For a candidacy function we proposed the following methods: - Cosine similarity of TF-IDF (semantic) - Paragraph embedding (semantic) candidacy(d 11 , d 21 ) → [0, 1] - Stopwords N-grams (structure) - Weighted average of Stopwords Ngrams and d 11 d 21 Paragraph embedding (semantic + structure) d 22 d 12 d 2n d 1n Milad Alshomary Pipeline for TR extraction 05.07.2018 24
A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Sample 1k documents Document Generate TR Sample from Wikipedia: Wikipedia Sample - Sample 1k documents from Wikipedia Text alignment using - Using Picapica framework to find picapica framework TR cases TR sample Milad Alshomary Pipeline for TR extraction 05.07.2018 25
A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Sample 1k documents Document Generate TR Sample from Wikipedia: Wikipedia Sample - Sample 1k documents from Wikipedia Text alignment using - Using Picapica framework to find picapica framework TR cases TR sample Milad Alshomary Pipeline for TR extraction 05.07.2018 26
A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Sample 1k documents Document Generate TR Sample from Wikipedia: Wikipedia Sample - Sample 1k documents from Wikipedia Text alignment using - Using Picapica framework to find picapica framework TR cases TR sample - 232 documents - ~ 90% have < 10 alignements (TR case) Milad Alshomary Pipeline for TR extraction 05.07.2018 27
A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Evaluation of “ candidacy” function: TR sample - For each document in TR sample: - Sort all Wikipedia articles according to the proposed “ candidacy” . - Precision/Recall on Thresholds of [1, 101,..,100k] - A True Positive (TP) is a pair of documents that have TR. T1 T2 T3 Milad Alshomary Pipeline for TR extraction 05.07.2018 28
A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Evaluation of “ candidacy” function: TR sample p1 p2 r2 r1 T1 T2 T3 Milad Alshomary Pipeline for TR extraction 05.07.2018 29
A Pipeline for Scalable Text Reuse Extraction Candidate Elimination Semantic hashing function: - Hashes documents into binary hashes. - Similar documents get similar or exact binary hash. 011001 011001 Milad Alshomary Pipeline for TR extraction 05.07.2018 30
A Pipeline for Scalable Text Reuse Extraction Candidate Elimination D 2 D 1 Semantic hashing function: - Hashing all documents. - Inverted index. - Hash document’s chunks. - Apply candidacy function only on Inverted index documents that intersect in one 001001 hash at least. 011001 011001 011001 001000 Milad Alshomary Pipeline for TR extraction 05.07.2018 31
A Pipeline for Scalable Text Reuse Extraction Candidate Elimination D 2 D 1 Semantic hashing function: - Hashing all documents. - Inverted index. - Hash document’s chunks. - Apply candidacy function only on Inverted index documents that intersect in one 001001 hash at least. 011001 011001 011001 001000 Milad Alshomary Pipeline for TR extraction 05.07.2018 32
A Pipeline for Scalable Text Reuse Extraction Candidate Elimination D 2 D 1 Semantic hashing function: - Hashing all documents. - Inverted index. - Hash document’s chunks. - Apply candidacy function only on Inverted index documents that intersect in one 001001 hash at least. 011001 011001 011001 001000 Milad Alshomary Pipeline for TR extraction 05.07.2018 33
A Pipeline for Scalable Text Reuse Extraction Candidate Elimination D 2 D 1 Semantic hashing function: - Hashing all documents. - Inverted index. - Hash document’s chunks. - Apply candidacy function only on Inverted index documents that intersect in one 001001 hash at least. 011001 011001 011001 001000 Milad Alshomary Pipeline for TR extraction 05.07.2018 34
A Pipeline for Scalable Text Reuse Extraction Candidate Elimination d i Proposed semantic hashing methods: - Random Projection (data independent) - Variational Deep Semantic Hashing (data dependent) d j Milad Alshomary Pipeline for TR extraction 05.07.2018 35
Recommend
More recommend