New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast
Motivation About 30% of the Web is redundant. [Fetterly 03, Broder 06] Content redundancy occurs in various forms: ❑ Mirrors. ❑ Crawl artifacts, such as the same text with a different date or a different advertisement, available through multiple URLs. ❑ Versions created for different delivery mechanisms (HTML, PDF , etc.) ❑ Annotated and unannotated copies of the same document ❑ Policies and procedures for the same purpose in different legislatures ❑ “Boilerplate” text such as license agreements or disclaimers ❑ Shared context such as summaries of other material or lists of links Introduction ❑ Syndicated news articles delivered in different venues Taxonomy of ❑ Revisions and versions Algorithms ❑ Reuse and republication of text (legitimate and otherwise) Algorithms Evaluation [Zobel 06] Corpus Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast
Motivation About 30% of the Web is redundant. [Fetterly 03, Broder 06] Content redundancy occurs in various forms: ❑ Mirrors. ❑ Crawl artifacts, such as the same text with a different date or a different advertisement, available through multiple URLs. ❑ Versions created for different delivery mechanisms (HTML, PDF , etc.) ❑ Annotated and unannotated copies of the same document ❑ Policies and procedures for the same purpose in different legislatures ❑ “Boilerplate” text such as license agreements or disclaimers ❑ Shared context such as summaries of other material or lists of links Introduction ❑ Syndicated news articles delivered in different venues Taxonomy of ❑ Revisions and versions Algorithms ❑ Reuse and republication of text (legitimate and otherwise) Algorithms Evaluation [Zobel 06] Corpus Summary Nearly exact copies and modified copies with high similarity. ➜ Near-duplicate documents. GfKl’07 Mar. 7th, 2007 Stein/Potthast
Motivation Contributions of near-duplicate detection to real-world tasks: ❑ Index size reduction ❑ Search result cleaning ❑ Web crawl prioritization ❑ Plagiarism analysis Our contributions to near-duplicate detection: ❑ Classification of near-duplicate detection algorithms Introduction Taxonomy of ❑ Presentation of a new tailored corpus for evaluation Algorithms Algorithms ❑ Comparison of current algorithms Evaluation (including so far unconsidered hashing technologies) Corpus Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast
Formalization Consider a set of documents D . Given a document d q : Find all documents D q ⊂ D with a high similarity to d q . ➜ Naive approach: Compare d q with each d ∈ D . In detail: Construct document models for D and d q , obtaining D and d q . Employ a similarity function ϕ : D × D → [0 , 1] . ❑ Near-duplicate detection algorithms rely on purposefully Introduction constructed document models, called fingerprints . Taxonomy of Algorithms ❑ A fingerprints is a set of k natural numbers, which are Algorithms computed on the basis document extracts. Evaluation Corpus ❑ Two documents are considered as duplicates if their Summary fingerprints share at least k d , k d < k , numbers. GfKl’07 Mar. 7th, 2007 Stein/Potthast
� y � y yy �� yy �� Taxonomy of Fingerprinting Algorithms Fingerprinting methods Chunking-� based Finger-� printing Similarity� Hashing Introduction Chunking: Taxonomy of Algorithms k chunks are selected from a document d . Algorithms σ� 351427� Evaluation ➜� ➜� ➜� {351427, 125497}� 125497� Corpus d Chunks c 1 , c 2� Hashcodes� Fingerprint� p 1 = h(c 1 ), p 2 = h(c 2 )� F d = { p 1 , p 2 }� Summary Chunks are also called n -grams or shingles. GfKl’07 Mar. 7th, 2007 Stein/Potthast
Taxonomy of Fingerprinting Algorithms Fingerprinting methods Algorithms n-gram model All Chunks rare chunks� Collection-specific SPEX, I-Match Chunking-� shingling, prefix anchors,� based Synchronized hashed breakpoints, � (Pseudo-)� winnowing Random random, n-th chunk Local Finger-� Cascading super-, megashingling printing Similarity� Hashing Introduction Chunking: Taxonomy of Algorithms k chunks are selected from a document d . Algorithms Selection heuristics: Evaluation Corpus ❑ all Summary ❑ based on knowledge about D ❑ intelligent random choices GfKl’07 Mar. 7th, 2007 Stein/Potthast
Taxonomy of Fingerprinting Algorithms Fingerprinting methods Algorithms n-gram model All Chunks rare chunks� Collection-specific SPEX, I-Match Chunking-� shingling, prefix anchors,� based Synchronized hashed breakpoints, � (Pseudo-)� winnowing Random random, n-th chunk Local Finger-� Cascading super-, megashingling printing Similarity� Hashing Introduction Similarity Hashing: Taxonomy of Algorithms k particular hash functions h ϕ : D → U , U ⊂ N , with the property Algorithms h ϕ ( d ) = h ϕ ( d q ) ⇒ ϕ ( d , d q ) ≥ 1 − ε with d ∈ D , 0 < ε ≪ 1 Evaluation Corpus are used to generate k hashcodes for a document d . Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast
Taxonomy of Fingerprinting Algorithms Fingerprinting methods Algorithms n-gram model All Chunks rare chunks� Collection-specific SPEX, I-Match Chunking-� shingling, prefix anchors,� based Synchronized hashed breakpoints, � (Pseudo-)� winnowing Random random, n-th chunk Local Finger-� Cascading super-, megashingling printing fuzzy-fingerprinting Knowledge-based Similarity� Hashing Randomized locality-sensitive hashing Introduction Taxonomy of Similarity Hashing: Algorithms k particular hash functions h ϕ : D → U , U ⊂ N , with the property Algorithms h ϕ ( d ) = h ϕ ( d q ) ⇒ ϕ ( d , d q ) ≥ 1 − ε with d ∈ D , 0 < ε ≪ 1 Evaluation Corpus are used to generate k hashcodes. Summary Hash function construction: domain knowledge vs. randomization. GfKl’07 Mar. 7th, 2007 Stein/Potthast
Taxonomy of Fingerprinting Algorithms Fingerprinting methods Algorithms n-gram model All Chunks rare chunks� Collection-specific SPEX, I-Match Chunking-� shingling, prefix anchors,� based Synchronized hashed breakpoints, � (Pseudo-)� winnowing Random random, n-th chunk Local Finger-� Cascading super-, megashingling printing fuzzy-fingerprinting Knowledge-based Similarity� Hashing Randomized locality-sensitive hashing Introduction Taxonomy of For algorithms in the upper box fingerprints have to share more Algorithms than one number, k d > 1 , to be recognized as duplicates Algorithms For algorithms in the lower box fingerprints need to share only one Evaluation number, k d = 1 , to be recognized as duplicates. Corpus Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast
(Cascaded) Chunking (Super-)Shingling (SSh) [Broder 97] ➜� ( ) w 1 w 2 w 3 w 2 w 3 w 4 n -gram vector � w 3 w 4 w 5 space model ... w m-2 w m-1 w m ➜� ( ) 12354 Introduction 15695 hash value� Taxonomy of 59634 computation Algorithms ... 43586 Algorithms ➜� Fingerprint� Evaluation Corpus {12354, 15695, ..., 55476} (random choice) Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast
(Cascaded) Chunking (Super-)Shingling (SSh) [Broder 97] Shingling Super-Shingling ➜� ( ) w 1 w 2 w 3 w 2 w 3 w 4 n -gram vector � w 3 w 4 w 5 space model ... w m-2 w m-1 w m Cascade ➜� ( ) 12354 Introduction 15695 hash value� Taxonomy of 59634 computation Algorithms ... 43586 Algorithms ➜� Fingerprint� Evaluation Corpus {12354, 15695, ..., 55476} (random choice) ➜� Summary "12354 15695 ... 55476" String representation GfKl’07 Mar. 7th, 2007 Stein/Potthast
Similarity Hashing Fuzzy-Fingerprinting (FF) [Stein 05] ➜� ➜� A priori probabilities of� Distribution of prefix� prefix classes in BNC classes in sample Normalization and� Introduction difference computation Taxonomy of ➜� Algorithms Fuzzification Algorithms ➜� Evaluation Corpus Fingerprint {213235632, 157234594} Summary All words having the same prefix belong to the same prefix class. GfKl’07 Mar. 7th, 2007 Stein/Potthast
Similarity Hashing Locality-Sensitive Hashing (LSH) [Indyk and Motwani 98, Datar et. al. 04] a 2 d a r Vector space with� a 1 sample document� and random vectors ➜� Introduction T a i . d Dot product computation Taxonomy of Algorithms ➜� Algorithms Real number line ➜� Evaluation Corpus Fingerprint {213235632} Summary The results of the r dot products are summed. GfKl’07 Mar. 7th, 2007 Stein/Potthast
Evaluation Corpus Wikipedia Snapshot including all Revisions Existing standard corpora (TREC, Reuters) are not suited for large-scale near-duplicate detection algorithm evaluations. Wikipedia is a rich resource of versioned and revisioned documents. Benchmark data: ❑ approx. 6 million pages (documents) ❑ approx. 80 million revisions Introduction ❑ XML file of approx. 1 TB Taxonomy of Algorithms Algorithms Evaluation Corpus Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast
Recommend
More recommend