Applying Hash-based Indexing in Text-based Information Retrieval - PowerPoint PPT Presentation

Applying Hash-based Indexing in Text-based Information Retrieval Benno Stein and Martin Potthast Bauhaus University Weimar Web-Technology and Information Systems Introduction Hash-based Indexing Methods Comparative Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast

Text-based Information Retrieval (TIR) Motivation Consider a set of documents D . Term query—given a set of query terms: Find all documents D ′ ⊂ D containing the query terms. ➜ Implemented by well-known web search engines. ➜ Best practice: Index D using an inverted file. Introduction Hash-based Indexing Methods Comparative Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast

Text-based Information Retrieval (TIR) Motivation Consider a set of documents D . Document query—given a document d : Find all documents D ′ ⊂ D with a high similarity to d . ➜ Use cases: plagiarism analysis, query by example ➜ Naive approach: Compare d with each d ′ ∈ D . In detail: Introduction Construct document models for D and d obtaining D and d . Hash-based Employ a similarity function ϕ : D × D → [0 , 1] . Indexing Methods Comparative Is it possible to be faster than the naive approach? Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast

Background Nearest Neighbour Search Given a set D of m -dimensional points and a point d : Find the point d ′ ∈ D which is nearest to d . Introduction Hash-based Indexing Methods Finding d ′ cannot be done better than in O ( | D | ) time if m exceeds 10 . Comparative [Weber et. al. 1998] Study Σ In our case: 1 . 000 ≪ m < 1 . 000 . 000 DIR’07 Mar. 29th, 2007 Stein/Potthast

Background Approximate Nearest Neighbour Search Given a set D of m -dimensional points and a point d : Find some points D ′ ⊂ D from a certain ε -neighbourhood of d . Introduction ε -neighbourhood Hash-based Indexing Methods Finding D ′ can be done in O (1) time with high probabilty by means Comparative of hashing. [Indyk and Motwani 1998] Study Σ The dimensionality m does not affect the runtime of their algorithm. DIR’07 Mar. 29th, 2007 Stein/Potthast

Text-based Information Retrieval (TIR) Nearest Neighbour Search Retrieval tasks Use cases focused search,� efficient search� Categorization (cluster hypothesis) Grouping Near-duplicate� preparation of � detection search results Partial document� plagiarism analysis similarity Index-based� Similarity� retrieval search Complete document � query by example similarity Classification directory maintenance Introduction Approximate retrieval results are often acceptable. Hash-based Indexing Methods Comparative Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast

Similarity Hashing Introduction With standard hash functions collisions occur accidentally. In similarity hashing collisions shall occur purposefully where the purpose is “high similarity”. Given a similarity function ϕ a hash function h ϕ : D → U with U ⊂ N Introduction resembles ϕ if it has the following property [Stein 2005] : Hash-based Indexing with d , d ′ ∈ D , 0 < ε ≪ 1 h ϕ ( d ) = h ϕ ( d ′ ) ⇒ ϕ ( d , d ′ ) ≥ 1 − ε Methods Comparative Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast

Similarity Hashing Index Construction Given a similarity hash function h ϕ a hash index µ h : D → D width D = P ( D ) is constructed using ❑ a hash table T ❑ a standard hash function h : U → { 1 , . . . , |T |} Introduction Hash-based Indexing Methods Comparative Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast

Similarity Hashing Index Construction Given a similarity hash function h ϕ a hash index µ h : D → D width D = P ( D ) is constructed using ❑ a hash table T ❑ a standard hash function h : U → { 1 , . . . , |T |} To index a set of documents D given their models D , Introduction ❑ compute for each d ∈ D its hash value h ϕ ( d ) Hash-based ❑ store a reference to d in T at storage position h ( h ϕ ( d )) Indexing Methods Comparative To search for documents similar to d given its model d , Study ❑ return the bucket in T at storage position h ( h ϕ ( d )) Σ DIR’07 Mar. 29th, 2007 Stein/Potthast

Similarity Hash Functions Fuzzy-Fingerprinting (FF) [Stein 2005] ➜� ➜� A priori probabilities of� Distribution of prefix� prefix classes in BNC classes in sample Normalization and� Introduction difference computation� ➜� Hash-based Indexing Fuzzification� Methods ➜� Comparative Study Fingerprint� {213235632, 157234594}� Σ All words having the same prefix belong to the same prefix class. DIR’07 Mar. 29th, 2007 Stein/Potthast

Similarity Hash Functions Locality-Sensitive Hashing (LSH) [Indyk and Motwani 1998, Datar et. al. 2004] a 2 d a k Vector space with� a 1 sample document� and random vectors Introduction ➜� Hash-based T a i . d Dot product computation Indexing Methods ➜� Real number line Comparative Study ➜� Fingerprint Σ {213235632} The results of the k dot products are summed. DIR’07 Mar. 29th, 2007 Stein/Potthast

Similarity Hash Functions Adjusting Recall and Precision Recall: h ϕ Introduction Hash-based Indexing Methods Comparative Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast

Similarity Hash Functions Adjusting Recall and Precision Recall: (FF) # fuzzy schemes. (LSH) # random vector sets. h ϕ h' ϕ A set of hash values per document is called fingerprint. Introduction Hash-based Indexing Methods Comparative Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast

Similarity Hash Functions Adjusting Recall and Precision Recall: (FF) # fuzzy schemes. (LSH) # random vector sets. h ϕ h' ϕ A set of hash values per document is called fingerprint. Introduction Hash-based Precision: Indexing Methods (FF) # prefix classes or Comparative # intervals per fuzzy scheme. Study (LSH) # random vectors. Σ DIR’07 Mar. 29th, 2007 Stein/Potthast

Experimental Setting Three test collections for three retrieval situations 1. Web results: 100 . 000 documents from a focused search. ➜ Documents as Web retrieval systems return them. 2. Plagiarism corpus: 3 . 000 documents with high similarity. ➜ Documents as they appear in plagiarism analysis. 3. Wikipedia Revision corpus: 6 m documents, 80 m revisions. ➜ Documents as they appear in social software, plagiarism analysis, and the Web. Introduction Hash-based Indexing ❑ first revision of each document used as query document d Methods ❑ comparison with each of d ’s revisions Comparative ❑ comparison with d ’s immediate succeeding document Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast

�� {{ �� zz �� yy || �� || yy zz {{ � � � z � { � | | y � z � y { � � | { � z � y � � � y � z { � y � � z | | { � � {{ || �� yy zz �� {{ zz yy �� || � z y � | � { | z � y � { { � y z y | �� yyy zzz {{{ ||| � � � z � � z y z { | � | � { � � y | � � { � | � � � z � y { �� || {{ zz yy �� || {{ yy �� zz Experimental Setting 1 Wikipedia Percentage of Similarities Web results 0.1 0.01 0.001 Introduction 0.0001 Hash-based 0 0.2 0.4 0.6 0.8 1 Indexing Methods Similarity Intervals Comparative Study Σ Precision and Recall were recorded for similarity thresholds ranging from 0 to 1 . DIR’07 Mar. 29th, 2007 Stein/Potthast

Results � 1� Wikipedia Revision Corpus� FF� LSH� 0.8� Recall� 0.6� 0.4� 0.2� Introduction Hash-based Indexing � 0� Methods � 0� 0.2� 0.4� 0.6� 0.8� � 1� Comparative Similarity� Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast

Results 1 0.8 Precision 0.6 0.4 0.2 Introduction FF Hash-based Wikipedia Revision Corpus LSH Indexing 0 Methods 0 0.2 0.4 0.6 0.8 1 Comparative Similarity Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast

Results � 1� FF� Web results Plagiarism corpus LSH 0.8� Recall� 0.6� 0.4� 0.2� FF� LSH � 0� � � 0� 0.2� 0.4� 0.6� 0.8� � 1� � 0� 0.2� 0.4� 0.6� 0.8� � 1� Similarity� � 1� Web results FF� LSH 0.8� Introduction Precision 0.6� Hash-based 0.4� Indexing Methods 0.2� FF� Comparative Plagiarism corpus LSH Study � 0� � 0� 0.2� 0.4� 0.6� 0.8� � 1� � 0� 0.2� 0.4� 0.6� 0.8� � 1� Similarity� Σ DIR’07 Mar. 29th, 2007 Stein/Potthast

Summary Similarity hashing may contribute to various retrieval tasks Comparison of similarity hash functions: ❑ FF outperforms LSH in terms of Precision and Recall. ❑ FF constructs significantly smaller fingerprints. Conclusions: ➜ Both hash-based indexing methods are applicable to TIR. ➜ The incorporation of domain knowledge significantly Introduction increases retrieval performance. Hash-based Indexing Methods None of the hash-based indexing methods is limited to TIR. Comparative The only prerequisite is a reasonable vector representation. Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast

Applying Hash-based Indexing in Text-based Information Retrieval - PowerPoint PPT Presentation

Applying Hash-based Indexing in Text-based Information Retrieval Benno Stein and Martin Potthast Bauhaus University Weimar Web-Technology and Information Systems Introduction Hash-based Indexing Methods Comparative Study DIR07 Mar.

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Hash Functions in Action Hash Functions in Action Lecture 12 Hash Functions Hash Functions

Hash Functions in Action Hash Functions in Action Lecture 11 Hash Functions Hash Functions

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Hash Functions Hash Functions 1 Cryptographic Hash Function Crypto hash function h(x) must

Hash Functions and Hash Tables (2.5.2) A hash function h maps keys of a given type to

Generics Asumu Takikawa RacketCon 2012 1 What are generics? 2 What are generics? hash-ref

Hash Pile Ups: Using Collisions to Identify Unknown Hash Functions R. Joshua Tobin and David

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, 2003 Searching Text grep

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Security Proofs for the MD6 Hash Algorithm Ahmed Ezzat Outline Introduction to hash

LUX Hash Function Ivica Nikoli c, Alex Biryukov, Dmitry Khovratovich University of Luxembourg

HASH FUNCTIONS Mihir Bellare UCSD 1 Mihir Bellare UCSD 2 Hash functions Hash functions

CSE 326: Data Structures (amortized) linked list Array Hash Tables Insert Find Hal Perkins

1 Starting point: for every hash function, there is a really bad input. A possible

Inf 2B: Hash Tables Lecture 4 of ADS thread Kyriakos Kalorkoti School of Informatics University

Hash Tables Outline Overview Implementation style for the Table ADT that is Definition

Collision Attacks on the Reduced Dual-Stream Hash Function RIPEMD-128 Florian Mendel 1 , Tomislav

Conditional Course Lecture 4 Hash Tables I: Separate Chaining and Open Addressing Fabian Kuhn

Hashing () Hashing () K08

A Parallel Compact Hash Table Alfons Laarman & Steven van der Vegt Overview Research

Sambuz

Useful Links

Newsletter

Mail Us

Applying Hash-based Indexing in Text-based Information Retrieval - PowerPoint PPT Presentation

Applying Hash-based Indexing in Text-based Information Retrieval Benno Stein and Martin Potthast Bauhaus University Weimar Web-Technology and Information Systems Introduction Hash-based Indexing Methods Comparative Study DIR07 Mar.

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Hash Functions in Action Hash Functions in Action Lecture 12 Hash Functions Hash Functions

Hash Functions in Action Hash Functions in Action Lecture 11 Hash Functions Hash Functions

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Hash Functions Hash Functions 1 Cryptographic Hash Function Crypto hash function h(x) must

Hash Functions and Hash Tables (2.5.2) A hash function h maps keys of a given type to

Generics Asumu Takikawa RacketCon 2012 1 What are generics? 2 What are generics? hash-ref

Hash Pile Ups: Using Collisions to Identify Unknown Hash Functions R. Joshua Tobin and David

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, 2003 Searching Text grep

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Security Proofs for the MD6 Hash Algorithm Ahmed Ezzat Outline Introduction to hash

LUX Hash Function Ivica Nikoli c, Alex Biryukov, Dmitry Khovratovich University of Luxembourg

HASH FUNCTIONS Mihir Bellare UCSD 1 Mihir Bellare UCSD 2 Hash functions Hash functions

CSE 326: Data Structures (amortized) linked list Array Hash Tables Insert Find Hal Perkins

1 Starting point: for every hash function, there is a really bad input. A possible

Inf 2B: Hash Tables Lecture 4 of ADS thread Kyriakos Kalorkoti School of Informatics University

Hash Tables Outline Overview Implementation style for the Table ADT that is Definition

Collision Attacks on the Reduced Dual-Stream Hash Function RIPEMD-128 Florian Mendel 1 , Tomislav

Conditional Course Lecture 4 Hash Tables I: Separate Chaining and Open Addressing Fabian Kuhn

Hashing () Hashing () K08

A Parallel Compact Hash Table Alfons Laarman &amp; Steven van der Vegt Overview Research

Sambuz

Useful Links

Newsletter

Mail Us

A Parallel Compact Hash Table Alfons Laarman & Steven van der Vegt Overview Research