New Issues in Near-duplicate Detection Martin Potthast and Benno - PowerPoint PPT Presentation

New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast

Motivation About 30% of the Web is redundant. [Fetterly 03, Broder 06] Content redundancy occurs in various forms: ❑ Mirrors. ❑ Crawl artifacts, such as the same text with a different date or a different advertisement, available through multiple URLs. ❑ Versions created for different delivery mechanisms (HTML, PDF , etc.) ❑ Annotated and unannotated copies of the same document ❑ Policies and procedures for the same purpose in different legislatures ❑ “Boilerplate” text such as license agreements or disclaimers ❑ Shared context such as summaries of other material or lists of links Introduction ❑ Syndicated news articles delivered in different venues Taxonomy of ❑ Revisions and versions Algorithms ❑ Reuse and republication of text (legitimate and otherwise) Algorithms Evaluation [Zobel 06] Corpus Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast

Motivation About 30% of the Web is redundant. [Fetterly 03, Broder 06] Content redundancy occurs in various forms: ❑ Mirrors. ❑ Crawl artifacts, such as the same text with a different date or a different advertisement, available through multiple URLs. ❑ Versions created for different delivery mechanisms (HTML, PDF , etc.) ❑ Annotated and unannotated copies of the same document ❑ Policies and procedures for the same purpose in different legislatures ❑ “Boilerplate” text such as license agreements or disclaimers ❑ Shared context such as summaries of other material or lists of links Introduction ❑ Syndicated news articles delivered in different venues Taxonomy of ❑ Revisions and versions Algorithms ❑ Reuse and republication of text (legitimate and otherwise) Algorithms Evaluation [Zobel 06] Corpus Summary Nearly exact copies and modified copies with high similarity. ➜ Near-duplicate documents. GfKl’07 Mar. 7th, 2007 Stein/Potthast

Motivation Contributions of near-duplicate detection to real-world tasks: ❑ Index size reduction ❑ Search result cleaning ❑ Web crawl prioritization ❑ Plagiarism analysis Our contributions to near-duplicate detection: ❑ Classification of near-duplicate detection algorithms Introduction Taxonomy of ❑ Presentation of a new tailored corpus for evaluation Algorithms Algorithms ❑ Comparison of current algorithms Evaluation (including so far unconsidered hashing technologies) Corpus Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast

Formalization Consider a set of documents D . Given a document d q : Find all documents D q ⊂ D with a high similarity to d q . ➜ Naive approach: Compare d q with each d ∈ D . In detail: Construct document models for D and d q , obtaining D and d q . Employ a similarity function ϕ : D × D → [0 , 1] . ❑ Near-duplicate detection algorithms rely on purposefully Introduction constructed document models, called fingerprints . Taxonomy of Algorithms ❑ A fingerprints is a set of k natural numbers, which are Algorithms computed on the basis document extracts. Evaluation Corpus ❑ Two documents are considered as duplicates if their Summary fingerprints share at least k d , k d < k , numbers. GfKl’07 Mar. 7th, 2007 Stein/Potthast

� y � y yy �� yy �� Taxonomy of Fingerprinting Algorithms Fingerprinting methods Chunking-� based Finger-� printing Similarity� Hashing Introduction Chunking: Taxonomy of Algorithms k chunks are selected from a document d . Algorithms σ� 351427� Evaluation ➜� ➜� ➜� {351427, 125497}� 125497� Corpus d Chunks c 1 , c 2� Hashcodes� Fingerprint� p 1 = h(c 1 ), p 2 = h(c 2 )� F d = { p 1 , p 2 }� Summary Chunks are also called n -grams or shingles. GfKl’07 Mar. 7th, 2007 Stein/Potthast

Taxonomy of Fingerprinting Algorithms Fingerprinting methods Algorithms n-gram model All Chunks rare chunks� Collection-specific SPEX, I-Match Chunking-� shingling, prefix anchors,� based Synchronized hashed breakpoints, � (Pseudo-)� winnowing Random random, n-th chunk Local Finger-� Cascading super-, megashingling printing Similarity� Hashing Introduction Chunking: Taxonomy of Algorithms k chunks are selected from a document d . Algorithms Selection heuristics: Evaluation Corpus ❑ all Summary ❑ based on knowledge about D ❑ intelligent random choices GfKl’07 Mar. 7th, 2007 Stein/Potthast

Taxonomy of Fingerprinting Algorithms Fingerprinting methods Algorithms n-gram model All Chunks rare chunks� Collection-specific SPEX, I-Match Chunking-� shingling, prefix anchors,� based Synchronized hashed breakpoints, � (Pseudo-)� winnowing Random random, n-th chunk Local Finger-� Cascading super-, megashingling printing Similarity� Hashing Introduction Similarity Hashing: Taxonomy of Algorithms k particular hash functions h ϕ : D → U , U ⊂ N , with the property Algorithms h ϕ ( d ) = h ϕ ( d q ) ⇒ ϕ ( d , d q ) ≥ 1 − ε with d ∈ D , 0 < ε ≪ 1 Evaluation Corpus are used to generate k hashcodes for a document d . Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast

Taxonomy of Fingerprinting Algorithms Fingerprinting methods Algorithms n-gram model All Chunks rare chunks� Collection-specific SPEX, I-Match Chunking-� shingling, prefix anchors,� based Synchronized hashed breakpoints, � (Pseudo-)� winnowing Random random, n-th chunk Local Finger-� Cascading super-, megashingling printing fuzzy-fingerprinting Knowledge-based Similarity� Hashing Randomized locality-sensitive hashing Introduction Taxonomy of Similarity Hashing: Algorithms k particular hash functions h ϕ : D → U , U ⊂ N , with the property Algorithms h ϕ ( d ) = h ϕ ( d q ) ⇒ ϕ ( d , d q ) ≥ 1 − ε with d ∈ D , 0 < ε ≪ 1 Evaluation Corpus are used to generate k hashcodes. Summary Hash function construction: domain knowledge vs. randomization. GfKl’07 Mar. 7th, 2007 Stein/Potthast

Taxonomy of Fingerprinting Algorithms Fingerprinting methods Algorithms n-gram model All Chunks rare chunks� Collection-specific SPEX, I-Match Chunking-� shingling, prefix anchors,� based Synchronized hashed breakpoints, � (Pseudo-)� winnowing Random random, n-th chunk Local Finger-� Cascading super-, megashingling printing fuzzy-fingerprinting Knowledge-based Similarity� Hashing Randomized locality-sensitive hashing Introduction Taxonomy of For algorithms in the upper box fingerprints have to share more Algorithms than one number, k d > 1 , to be recognized as duplicates Algorithms For algorithms in the lower box fingerprints need to share only one Evaluation number, k d = 1 , to be recognized as duplicates. Corpus Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast

(Cascaded) Chunking (Super-)Shingling (SSh) [Broder 97] ➜� ( ) w 1 w 2 w 3 w 2 w 3 w 4 n -gram vector � w 3 w 4 w 5 space model ... w m-2 w m-1 w m ➜� ( ) 12354 Introduction 15695 hash value� Taxonomy of 59634 computation Algorithms ... 43586 Algorithms ➜� Fingerprint� Evaluation Corpus {12354, 15695, ..., 55476} (random choice) Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast

(Cascaded) Chunking (Super-)Shingling (SSh) [Broder 97] Shingling Super-Shingling ➜� ( ) w 1 w 2 w 3 w 2 w 3 w 4 n -gram vector � w 3 w 4 w 5 space model ... w m-2 w m-1 w m Cascade ➜� ( ) 12354 Introduction 15695 hash value� Taxonomy of 59634 computation Algorithms ... 43586 Algorithms ➜� Fingerprint� Evaluation Corpus {12354, 15695, ..., 55476} (random choice) ➜� Summary "12354 15695 ... 55476" String representation GfKl’07 Mar. 7th, 2007 Stein/Potthast

Similarity Hashing Fuzzy-Fingerprinting (FF) [Stein 05] ➜� ➜� A priori probabilities of� Distribution of prefix� prefix classes in BNC classes in sample Normalization and� Introduction difference computation Taxonomy of ➜� Algorithms Fuzzification Algorithms ➜� Evaluation Corpus Fingerprint {213235632, 157234594} Summary All words having the same prefix belong to the same prefix class. GfKl’07 Mar. 7th, 2007 Stein/Potthast

Similarity Hashing Locality-Sensitive Hashing (LSH) [Indyk and Motwani 98, Datar et. al. 04] a 2 d a r Vector space with� a 1 sample document� and random vectors ➜� Introduction T a i . d Dot product computation Taxonomy of Algorithms ➜� Algorithms Real number line ➜� Evaluation Corpus Fingerprint {213235632} Summary The results of the r dot products are summed. GfKl’07 Mar. 7th, 2007 Stein/Potthast

Evaluation Corpus Wikipedia Snapshot including all Revisions Existing standard corpora (TREC, Reuters) are not suited for large-scale near-duplicate detection algorithm evaluations. Wikipedia is a rich resource of versioned and revisioned documents. Benchmark data: ❑ approx. 6 million pages (documents) ❑ approx. 80 million revisions Introduction ❑ XML file of approx. 1 TB Taxonomy of Algorithms Algorithms Evaluation Corpus Summary GfKl’07 Mar. 7th, 2007 Stein/Potthast

New Issues in Near-duplicate Detection Martin Potthast and Benno - PowerPoint PPT Presentation

New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary GfKl07 Mar. 7th, 2007

Web Information Retrieval Lecture 10 Crawling and Near-Duplicate Document Detection Todays

1 Near-Duplicate News Articles Near-Duplicate Detection More challenging task Are web

Cleaning Up the Neighborhood: Duplicate Ryan de Vera, Anna Ma, Daniel Moyer, Brendan Detection

Duplicate Encounter Avoidance Guidelines MCO Encounter Improvement Initiative Meridian Health

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

Topic: Duplicate Detection and Similarity Computing UCSB 290N, 2015 Tao Yang Some of slides

Efficient Semantic-Aware Detection of Near Duplicate Resources 7 th Extended Semantic Web

Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute

Estimating web size and search engine index size Near-duplicate document detection Size of the

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

Solar ROI for Eastside Businesses Confidential. Do not duplicate or retransmit. What we do

Mahjong International League (MIL) and Duplicate Mahjong History of Mahjong Modern Mahjong and

2016 ANALYST MEETING KEVIN HOLLERAN 1 COMPANY CONFIDENTIAL DO NOT DUPLICATE OR DISTRIBUTE

Amalgamated Models for Detecting Duplicate Bug Reports Sukhjit Singh Sehra Tamer Abdou Ay se

NOT FOR USE Presented by: Instructor Name WELCOME! DO NOT DUPLICATE Registered Representative

Improving Stafg Productivity by Providing Developers with a Workfmow-Oriented Operational

Sponsors Wednesday Webinars Upcoming Webinars- 8/26 Newsletters August 12, 2015 9/9 How

Single-Source Architecture Principles Single-Source Architecture is strategy for building websites

Content Marketing How a 60-year-old media company is pioneering the future of advertising Ninan

Being stranded on the Carbon Bubble? Climate policy risk and the cost of loans Discussion

VIVOandScholarlyRepositories: SynergisticOpportunities

Chapter 10 Marketing Research Marketing Research DATA Collecting Decision & Analyzing

GERMAN WEBINAR MAY 2020 Information Classification: Restricted ON TODAYS CALL DR. CHRISTIAN

Sambuz

Useful Links

Newsletter

Mail Us

New Issues in Near-duplicate Detection Martin Potthast and Benno - PowerPoint PPT Presentation

New Issues in Near-duplicate Detection Martin Potthast and Benno Stein Bauhaus University Weimar Web Technology and Information Systems Introduction Taxonomy of Algorithms Algorithms Evaluation Corpus Summary GfKl07 Mar. 7th, 2007

Web Information Retrieval Lecture 10 Crawling and Near-Duplicate Document Detection Todays

1 Near-Duplicate News Articles Near-Duplicate Detection More challenging task Are web

Cleaning Up the Neighborhood: Duplicate Ryan de Vera, Anna Ma, Daniel Moyer, Brendan Detection

Duplicate Encounter Avoidance Guidelines MCO Encounter Improvement Initiative Meridian Health

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

Topic: Duplicate Detection and Similarity Computing UCSB 290N, 2015 Tao Yang Some of slides

Efficient Semantic-Aware Detection of Near Duplicate Resources 7 th Extended Semantic Web

Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute

Estimating web size and search engine index size Near-duplicate document detection Size of the

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

Solar ROI for Eastside Businesses Confidential. Do not duplicate or retransmit. What we do

Mahjong International League (MIL) and Duplicate Mahjong History of Mahjong Modern Mahjong and

2016 ANALYST MEETING KEVIN HOLLERAN 1 COMPANY CONFIDENTIAL DO NOT DUPLICATE OR DISTRIBUTE

Amalgamated Models for Detecting Duplicate Bug Reports Sukhjit Singh Sehra Tamer Abdou Ay se

NOT FOR USE Presented by: Instructor Name WELCOME! DO NOT DUPLICATE Registered Representative

Improving Stafg Productivity by Providing Developers with a Workfmow-Oriented Operational

Sponsors Wednesday Webinars Upcoming Webinars- 8/26 Newsletters August 12, 2015 9/9 How

Single-Source Architecture Principles Single-Source Architecture is strategy for building websites

Content Marketing How a 60-year-old media company is pioneering the future of advertising Ninan

Being stranded on the Carbon Bubble? Climate policy risk and the cost of loans Discussion

VIVOandScholarlyRepositories: SynergisticOpportunities

Chapter 10 Marketing Research Marketing Research DATA Collecting Decision &amp; Analyzing

GERMAN WEBINAR MAY 2020 Information Classification: Restricted ON TODAYS CALL DR. CHRISTIAN

Sambuz

Useful Links

Newsletter

Mail Us

Chapter 10 Marketing Research Marketing Research DATA Collecting Decision & Analyzing