Applying Hash-based Indexing in Text-based Information Retrieval Benno Stein and Martin Potthast Bauhaus University Weimar Web-Technology and Information Systems Introduction Hash-based Indexing Methods Comparative Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast
Text-based Information Retrieval (TIR) Motivation Consider a set of documents D . Term query—given a set of query terms: Find all documents D ′ ⊂ D containing the query terms. ➜ Implemented by well-known web search engines. ➜ Best practice: Index D using an inverted file. Introduction Hash-based Indexing Methods Comparative Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast
Text-based Information Retrieval (TIR) Motivation Consider a set of documents D . Document query—given a document d : Find all documents D ′ ⊂ D with a high similarity to d . ➜ Use cases: plagiarism analysis, query by example ➜ Naive approach: Compare d with each d ′ ∈ D . In detail: Introduction Construct document models for D and d obtaining D and d . Hash-based Employ a similarity function ϕ : D × D → [0 , 1] . Indexing Methods Comparative Is it possible to be faster than the naive approach? Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast
Background Nearest Neighbour Search Given a set D of m -dimensional points and a point d : Find the point d ′ ∈ D which is nearest to d . Introduction Hash-based Indexing Methods Finding d ′ cannot be done better than in O ( | D | ) time if m exceeds 10 . Comparative [Weber et. al. 1998] Study Σ In our case: 1 . 000 ≪ m < 1 . 000 . 000 DIR’07 Mar. 29th, 2007 Stein/Potthast
Background Approximate Nearest Neighbour Search Given a set D of m -dimensional points and a point d : Find some points D ′ ⊂ D from a certain ε -neighbourhood of d . Introduction ε -neighbourhood Hash-based Indexing Methods Finding D ′ can be done in O (1) time with high probabilty by means Comparative of hashing. [Indyk and Motwani 1998] Study Σ The dimensionality m does not affect the runtime of their algorithm. DIR’07 Mar. 29th, 2007 Stein/Potthast
Text-based Information Retrieval (TIR) Nearest Neighbour Search Retrieval tasks Use cases focused search,� efficient search� Categorization (cluster hypothesis) Grouping Near-duplicate� preparation of � detection search results Partial document� plagiarism analysis similarity Index-based� Similarity� retrieval search Complete document � query by example similarity Classification directory maintenance Introduction Approximate retrieval results are often acceptable. Hash-based Indexing Methods Comparative Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast
Similarity Hashing Introduction With standard hash functions collisions occur accidentally. In similarity hashing collisions shall occur purposefully where the purpose is “high similarity”. Given a similarity function ϕ a hash function h ϕ : D → U with U ⊂ N Introduction resembles ϕ if it has the following property [Stein 2005] : Hash-based Indexing with d , d ′ ∈ D , 0 < ε ≪ 1 h ϕ ( d ) = h ϕ ( d ′ ) ⇒ ϕ ( d , d ′ ) ≥ 1 − ε Methods Comparative Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast
Similarity Hashing Index Construction Given a similarity hash function h ϕ a hash index µ h : D → D width D = P ( D ) is constructed using ❑ a hash table T ❑ a standard hash function h : U → { 1 , . . . , |T |} Introduction Hash-based Indexing Methods Comparative Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast
Similarity Hashing Index Construction Given a similarity hash function h ϕ a hash index µ h : D → D width D = P ( D ) is constructed using ❑ a hash table T ❑ a standard hash function h : U → { 1 , . . . , |T |} To index a set of documents D given their models D , Introduction ❑ compute for each d ∈ D its hash value h ϕ ( d ) Hash-based ❑ store a reference to d in T at storage position h ( h ϕ ( d )) Indexing Methods Comparative To search for documents similar to d given its model d , Study ❑ return the bucket in T at storage position h ( h ϕ ( d )) Σ DIR’07 Mar. 29th, 2007 Stein/Potthast
Similarity Hash Functions Fuzzy-Fingerprinting (FF) [Stein 2005] ➜� ➜� A priori probabilities of� Distribution of prefix� prefix classes in BNC classes in sample Normalization and� Introduction difference computation� ➜� Hash-based Indexing Fuzzification� Methods ➜� Comparative Study Fingerprint� {213235632, 157234594}� Σ All words having the same prefix belong to the same prefix class. DIR’07 Mar. 29th, 2007 Stein/Potthast
Similarity Hash Functions Locality-Sensitive Hashing (LSH) [Indyk and Motwani 1998, Datar et. al. 2004] a 2 d a k Vector space with� a 1 sample document� and random vectors Introduction ➜� Hash-based T a i . d Dot product computation Indexing Methods ➜� Real number line Comparative Study ➜� Fingerprint Σ {213235632} The results of the k dot products are summed. DIR’07 Mar. 29th, 2007 Stein/Potthast
Similarity Hash Functions Adjusting Recall and Precision Recall: h ϕ Introduction Hash-based Indexing Methods Comparative Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast
Similarity Hash Functions Adjusting Recall and Precision Recall: (FF) # fuzzy schemes. (LSH) # random vector sets. h ϕ h' ϕ A set of hash values per document is called fingerprint. Introduction Hash-based Indexing Methods Comparative Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast
Similarity Hash Functions Adjusting Recall and Precision Recall: (FF) # fuzzy schemes. (LSH) # random vector sets. h ϕ h' ϕ A set of hash values per document is called fingerprint. Introduction Hash-based Precision: Indexing Methods (FF) # prefix classes or Comparative # intervals per fuzzy scheme. Study (LSH) # random vectors. Σ DIR’07 Mar. 29th, 2007 Stein/Potthast
Experimental Setting Three test collections for three retrieval situations 1. Web results: 100 . 000 documents from a focused search. ➜ Documents as Web retrieval systems return them. 2. Plagiarism corpus: 3 . 000 documents with high similarity. ➜ Documents as they appear in plagiarism analysis. 3. Wikipedia Revision corpus: 6 m documents, 80 m revisions. ➜ Documents as they appear in social software, plagiarism analysis, and the Web. Introduction Hash-based Indexing ❑ first revision of each document used as query document d Methods ❑ comparison with each of d ’s revisions Comparative ❑ comparison with d ’s immediate succeeding document Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast
�� �� {{ �� zz �� yy || �� �� �� �� || yy zz {{ � � � z � { � | | y � z � y { � � | { � z � y � � � y � z { � y � � z | | { � � {{ || �� �� �� �� yy zz �� {{ zz yy �� �� �� || � z y � | � { | z � y � { { � y z y | ��� ��� ��� ��� yyy zzz {{{ ||| � � � z � � z y z { | � | � { � � y | � � { � | � � � z � y { �� || {{ zz yy �� �� �� || {{ yy �� �� �� �� zz Experimental Setting 1 Wikipedia Percentage of Similarities Web results 0.1 0.01 0.001 Introduction 0.0001 Hash-based 0 0.2 0.4 0.6 0.8 1 Indexing Methods Similarity Intervals Comparative Study Σ Precision and Recall were recorded for similarity thresholds ranging from 0 to 1 . DIR’07 Mar. 29th, 2007 Stein/Potthast
Results � 1� Wikipedia Revision Corpus� FF� LSH� 0.8� Recall� 0.6� 0.4� 0.2� Introduction Hash-based Indexing � 0� Methods � 0� 0.2� 0.4� 0.6� 0.8� � 1� Comparative Similarity� Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast
Results 1 0.8 Precision 0.6 0.4 0.2 Introduction FF Hash-based Wikipedia Revision Corpus LSH Indexing 0 Methods 0 0.2 0.4 0.6 0.8 1 Comparative Similarity Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast
Results � 1� FF� Web results Plagiarism corpus LSH 0.8� Recall� 0.6� 0.4� 0.2� FF� LSH � 0� � � 0� 0.2� 0.4� 0.6� 0.8� � 1� � 0� 0.2� 0.4� 0.6� 0.8� � 1� Similarity� � 1� Web results FF� LSH 0.8� Introduction Precision 0.6� Hash-based 0.4� Indexing Methods 0.2� FF� Comparative Plagiarism corpus LSH Study � 0� � 0� 0.2� 0.4� 0.6� 0.8� � 1� � 0� 0.2� 0.4� 0.6� 0.8� � 1� Similarity� Σ DIR’07 Mar. 29th, 2007 Stein/Potthast
Summary Similarity hashing may contribute to various retrieval tasks Comparison of similarity hash functions: ❑ FF outperforms LSH in terms of Precision and Recall. ❑ FF constructs significantly smaller fingerprints. Conclusions: ➜ Both hash-based indexing methods are applicable to TIR. ➜ The incorporation of domain knowledge significantly Introduction increases retrieval performance. Hash-based Indexing Methods None of the hash-based indexing methods is limited to TIR. Comparative The only prerequisite is a reasonable vector representation. Study Σ DIR’07 Mar. 29th, 2007 Stein/Potthast
Recommend
More recommend