new issues in near duplicate detection
play

New Issues in Near-duplicate Detection Martin Potthast and Benno - PowerPoint PPT Presentation

New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary GfKl07 Mar. 7th, 2007


  1. New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast

  2. Motivation About 30% of the Web is redundant. [Fetterly 03, Broder 06] Content redundancy occurs in various forms: ❑ Mirrors. ❑ Crawl artifacts, such as the same text with a different date or a different advertisement, available through multiple URLs. ❑ Versions created for different delivery mechanisms (HTML, PDF , etc.) ❑ Annotated and unannotated copies of the same document ❑ Policies and procedures for the same purpose in different legislatures ❑ “Boilerplate” text such as license agreements or disclaimers ❑ Shared context such as summaries of other material or lists of links Introduction ❑ Syndicated news articles delivered in different venues Taxonomy of ❑ Revisions and versions Algorithms ❑ Reuse and republication of text (legitimate and otherwise) Algorithms Evaluation [Zobel 06] Corpus Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast

  3. Motivation About 30% of the Web is redundant. [Fetterly 03, Broder 06] Content redundancy occurs in various forms: ❑ Mirrors. ❑ Crawl artifacts, such as the same text with a different date or a different advertisement, available through multiple URLs. ❑ Versions created for different delivery mechanisms (HTML, PDF , etc.) ❑ Annotated and unannotated copies of the same document ❑ Policies and procedures for the same purpose in different legislatures ❑ “Boilerplate” text such as license agreements or disclaimers ❑ Shared context such as summaries of other material or lists of links Introduction ❑ Syndicated news articles delivered in different venues Taxonomy of ❑ Revisions and versions Algorithms ❑ Reuse and republication of text (legitimate and otherwise) Algorithms Evaluation [Zobel 06] Corpus Summary Nearly exact copies and modified copies with high similarity. ➜ Near-duplicate documents. GfKl’07 Mar. 7th, 2007 Stein/Potthast

  4. Motivation Contributions of near-duplicate detection to real-world tasks: ❑ Index size reduction ❑ Search result cleaning ❑ Web crawl prioritization ❑ Plagiarism analysis Our contributions to near-duplicate detection: ❑ Classification of near-duplicate detection algorithms Introduction Taxonomy of ❑ Presentation of a new tailored corpus for evaluation Algorithms Algorithms ❑ Comparison of current algorithms Evaluation (including so far unconsidered hashing technologies) Corpus Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast

  5. Formalization Consider a set of documents D . Given a document d q : Find all documents D q ⊂ D with a high similarity to d q . ➜ Naive approach: Compare d q with each d ∈ D . In detail: Construct document models for D and d q , obtaining D and d q . Employ a similarity function ϕ : D × D → [0 , 1] . ❑ Near-duplicate detection algorithms rely on purposefully Introduction constructed document models, called fingerprints . Taxonomy of Algorithms ❑ A fingerprints is a set of k natural numbers, which are Algorithms computed on the basis document extracts. Evaluation Corpus ❑ Two documents are considered as duplicates if their Summary fingerprints share at least k d , k d < k , numbers. GfKl’07 Mar. 7th, 2007 Stein/Potthast

  6. � y � y yy �� yy �� Taxonomy of Fingerprinting Algorithms Fingerprinting methods Chunking-� based Finger-� printing Similarity� Hashing Introduction Chunking: Taxonomy of Algorithms k chunks are selected from a document d . Algorithms σ� 351427� Evaluation ➜� ➜� ➜� {351427, 125497}� 125497� Corpus d Chunks c 1 , c 2� Hashcodes� Fingerprint� p 1 = h(c 1 ), p 2 = h(c 2 )� F d = { p 1 , p 2 }� Summary Chunks are also called n -grams or shingles. GfKl’07 Mar. 7th, 2007 Stein/Potthast

  7. Taxonomy of Fingerprinting Algorithms Fingerprinting methods Algorithms n-gram model All Chunks rare chunks� Collection-specific SPEX, I-Match Chunking-� shingling, prefix anchors,� based Synchronized hashed breakpoints, � (Pseudo-)� winnowing Random random, n-th chunk Local Finger-� Cascading super-, megashingling printing Similarity� Hashing Introduction Chunking: Taxonomy of Algorithms k chunks are selected from a document d . Algorithms Selection heuristics: Evaluation Corpus ❑ all Summary ❑ based on knowledge about D ❑ intelligent random choices GfKl’07 Mar. 7th, 2007 Stein/Potthast

  8. Taxonomy of Fingerprinting Algorithms Fingerprinting methods Algorithms n-gram model All Chunks rare chunks� Collection-specific SPEX, I-Match Chunking-� shingling, prefix anchors,� based Synchronized hashed breakpoints, � (Pseudo-)� winnowing Random random, n-th chunk Local Finger-� Cascading super-, megashingling printing Similarity� Hashing Introduction Similarity Hashing: Taxonomy of Algorithms k particular hash functions h ϕ : D → U , U ⊂ N , with the property Algorithms h ϕ ( d ) = h ϕ ( d q ) ⇒ ϕ ( d , d q ) ≥ 1 − ε with d ∈ D , 0 < ε ≪ 1 Evaluation Corpus are used to generate k hashcodes for a document d . Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast

  9. Taxonomy of Fingerprinting Algorithms Fingerprinting methods Algorithms n-gram model All Chunks rare chunks� Collection-specific SPEX, I-Match Chunking-� shingling, prefix anchors,� based Synchronized hashed breakpoints, � (Pseudo-)� winnowing Random random, n-th chunk Local Finger-� Cascading super-, megashingling printing fuzzy-fingerprinting Knowledge-based Similarity� Hashing Randomized locality-sensitive hashing Introduction Taxonomy of Similarity Hashing: Algorithms k particular hash functions h ϕ : D → U , U ⊂ N , with the property Algorithms h ϕ ( d ) = h ϕ ( d q ) ⇒ ϕ ( d , d q ) ≥ 1 − ε with d ∈ D , 0 < ε ≪ 1 Evaluation Corpus are used to generate k hashcodes. Summary Hash function construction: domain knowledge vs. randomization. GfKl’07 Mar. 7th, 2007 Stein/Potthast

  10. Taxonomy of Fingerprinting Algorithms Fingerprinting methods Algorithms n-gram model All Chunks rare chunks� Collection-specific SPEX, I-Match Chunking-� shingling, prefix anchors,� based Synchronized hashed breakpoints, � (Pseudo-)� winnowing Random random, n-th chunk Local Finger-� Cascading super-, megashingling printing fuzzy-fingerprinting Knowledge-based Similarity� Hashing Randomized locality-sensitive hashing Introduction Taxonomy of For algorithms in the upper box fingerprints have to share more Algorithms than one number, k d > 1 , to be recognized as duplicates Algorithms For algorithms in the lower box fingerprints need to share only one Evaluation number, k d = 1 , to be recognized as duplicates. Corpus Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast

  11. (Cascaded) Chunking (Super-)Shingling (SSh) [Broder 97] ➜� ( ) w 1 w 2 w 3 w 2 w 3 w 4 n -gram vector � w 3 w 4 w 5 space model ... w m-2 w m-1 w m ➜� ( ) 12354 Introduction 15695 hash value� Taxonomy of 59634 computation Algorithms ... 43586 Algorithms ➜� Fingerprint� Evaluation Corpus {12354, 15695, ..., 55476} (random choice) Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast

  12. (Cascaded) Chunking (Super-)Shingling (SSh) [Broder 97] Shingling Super-Shingling ➜� ( ) w 1 w 2 w 3 w 2 w 3 w 4 n -gram vector � w 3 w 4 w 5 space model ... w m-2 w m-1 w m Cascade ➜� ( ) 12354 Introduction 15695 hash value� Taxonomy of 59634 computation Algorithms ... 43586 Algorithms ➜� Fingerprint� Evaluation Corpus {12354, 15695, ..., 55476} (random choice) ➜� Summary "12354 15695 ... 55476" String representation GfKl’07 Mar. 7th, 2007 Stein/Potthast

  13. Similarity Hashing Fuzzy-Fingerprinting (FF) [Stein 05] ➜� ➜� A priori probabilities of� Distribution of prefix� prefix classes in BNC classes in sample Normalization and� Introduction difference computation Taxonomy of ➜� Algorithms Fuzzification Algorithms ➜� Evaluation Corpus Fingerprint {213235632, 157234594} Summary All words having the same prefix belong to the same prefix class. GfKl’07 Mar. 7th, 2007 Stein/Potthast

  14. Similarity Hashing Locality-Sensitive Hashing (LSH) [Indyk and Motwani 98, Datar et. al. 04] a 2 d a r Vector space with� a 1 sample document� and random vectors ➜� Introduction T a i . d Dot product computation Taxonomy of Algorithms ➜� Algorithms Real number line ➜� Evaluation Corpus Fingerprint {213235632} Summary The results of the r dot products are summed. GfKl’07 Mar. 7th, 2007 Stein/Potthast

  15. Evaluation Corpus Wikipedia Snapshot including all Revisions Existing standard corpora (TREC, Reuters) are not suited for large-scale near-duplicate detection algorithm evaluations. Wikipedia is a rich resource of versioned and revisioned documents. Benchmark data: ❑ approx. 6 million pages (documents) ❑ approx. 80 million revisions Introduction ❑ XML file of approx. 1 TB Taxonomy of Algorithms Algorithms Evaluation Corpus Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast

Recommend


More recommend