PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search Andrea Esuli andrea.esuli@isti.cnr.it Istituto di Scienza e Tecnologie dell’Informazione “A. Faedo” Consiglio Nazionale delle Ricerche Via Giuseppe Moruzzi, 1 — 56124 Pisa, Italy ISTI:Science seminar, May 12, 2009 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 1 / 48
Outline Introduction 1 The PP-Index 2 Experiments 3 Demo 4 Conclusions 5 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 2 / 48
Introduction Outline Introduction 1 The PP-Index 2 Experiments 3 Demo 4 Conclusions 5 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 3 / 48
Introduction Similarity search Outline Introduction 1 Similarity search Permutation based methods Local similarity hashing methods The PP-Index 2 Experiments 3 Demo 4 Conclusions 5 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 4 / 48
Introduction Similarity search Similarity search The similarity search model involves: A collection of objects D , belonging to a domain O ; a query object q ∈ O ; a distance function d : O × O → R + . The goal is to sort the objects in D by their distance with respect to q , returning the objects that are closer to q , which are considered to be the most similar. Typically only the k -top ranked objects are returned ( k -NN query), or those within a maximum distance value r (range query). The determination of a meaningful r value is often a non-easy task. k -NN queries are usually preferred, specially in end-user applications, also for the direct control on the result set size. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 5 / 48
Introduction Similarity search Similarity search Example ( R 2 , L 2 ): o o 2 2 o o o o 1 1 3 3 o o 10 10 o o o o o o 5 o o 0 0 5 4 4 q 9 9 q r o o 8 8 o o o o 11 11 7 7 o o 6 6 o o 12 12 Figure 1: Range query. Figure 2: k -NN query ( k = 5 ). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 6 / 48
Introduction Similarity search Approximate similarity search Exhaustive search: for all o i ∈ D compute the distance d ( q, o i ) , while keeping track of which objects satisfy the query. It does not scale to large collections. Exact methods: equivalent to exhaustive search, but using data structures that leverage on the properties of the observed similarity space (e.g., vectorial spaces, metric spaces) in order to reduce the number of objects of D to be compared with the query. Usually efficient but still not enough for huge collections. Approximate methods: accepting that the results could contain errors (e.g., d ( q, o 1 ) < d ( q, o 2 ) , o 2 is in the results and o 1 is not), gaining efficiency. Approximation is acceptable, e.g., when d is an approximation of a complex, human-perceived concept of similarity. It (obviously) scales! Typically derived from “relaxed” exact methods. Natively approximated proposals, e.g.: local similarity hashing (LSH) index and permutation-based index (the PP-Index takes inspiration from both). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 7 / 48
Introduction Similarity search Approximate similarity search Approximation quality: What have we missed? What have we included? How much have we saved? o 2 o o 1 3 o 10 o o o 5 o 0 4 9 q o 8 o o 11 7 o 6 o 12 Figure 3: Approximate result for a k -NN query ( k = 5 ). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 8 / 48
Introduction Permutation based methods Outline Introduction 1 Similarity search Permutation based methods Local similarity hashing methods The PP-Index 2 Experiments 3 Demo 4 Conclusions 5 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 9 / 48
Introduction Permutation based methods Permutation based methods Independently proposed by Amato and Savino 1 and Chavez et al. 2 , using different data structures. The idea: an object is represented by its view of the surrounding world. Intuively, if two objects “see” the elements of a set of reference objects R in the same order of (increasing) distance, they are likely to be close one to the other. Example Where am I likely to live if I see the main European cities in the following order? Rome, Milan, Bern, Marseilles, Munich, Luxembourg, Bonn, Vienna, Belgrade, Brussels, Barcelona, Paris, Berlin, Amsterdam, London, Copenhagen, Madrid, Istanbul, Dublin, Athens, Oslo, Stockholm, Lisbon, Helsinki. 1 G. Amato and P. Savino, Approximate similarity search in metric spaces using inverted files , INFOSCALE 2008, pages 1-10. 2 E. Chavez, K. Figueroa, and G. Navarro, Effective proximity retrieval by ordering permutations , IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(9):1647-1658, 2008. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 10 / 48
Introduction Permutation based methods Permutation based methods The method: A set of reference objects R = { r 0 , . . . , r | R |− 1 } ⊂ O is defined (e.g., by randomly selecting | R | objects from D ). Every object o i ∈ D is then represented by a permutation Π o i of � 0 , . . . , | R | − 1 � , i.e., the list of the identifiers of reference objects, so that the identifiers are sorted by the distance of their relative reference objects with respect to o i . The search process mainly consists in computing Π q and estimating the true distance d ( q, o i ) using a permutation-based distance d ′ (Π q , Π o i ) , e.g., the Spearman’s footrule distance . Amato and Savino have shown that using only the prefix Π l o i of the permutation Π o i (e.g., l = 100 when | R | = 500 ) improves both efficiency and effectiveness. The PP-Index adopts a permutation-based data representation model, using very short prefixes (e.g., l = 6 when | R | = 1000 ). Differently from previous approaches, the permutation prefixes are used just to quickly find a small set of candidate objects from D for inclusion into results, not to estimate their relative order. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 11 / 48
Introduction Permutation based methods Permutation based methods <0,1,3> r r <1,3,4> <1,3,0> 0 0 <1,4,3,2,0,5> <0,2,3,1,5,4> <1,3,2,0,4,5> <0,2,3> <1,4,3> <1,3,2> r r 1 1 <2,3,0> <3,2,1> <2,0,3> r r 3 3 <2,3,5> r <4,1,3> r <4,1,3,2,5,0> 2 2 r <2,5,3> r 4 4 <5,2,3,0,1,4> <3,2,5> <4,3,1> <5,2,3> <4,3,1,2,5,0> r r 5 5 <4,3,5> <5,2,3,4,1,0> Figure 4: Regions of the 2-dimensional space identified by 6 randomly selected reference points, using the Euclidean distance, and full-lenght permutations (left) or permutation prefixes of lenght 3. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 12 / 48
Introduction Local similarity hashing methods Outline Introduction 1 Similarity search Permutation based methods Local similarity hashing methods The PP-Index 2 Experiments 3 Demo 4 Conclusions 5 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 13 / 48
Introduction Local similarity hashing methods Local similarity hashing methods A family H of hash functions f : O → U is called ( r, ǫ, p 1 , p 2 ) -sensitive, with r, ǫ > 0 , p 1 > p 2 > 0 , if for any p, q ∈ O : if d ( p, q ) ≤ r then P [ h ( p ) = h ( q )] ≥ p 1 if d ( p, q ) > r (1 + ǫ ) then P [ h ( p ) = h ( q )] ≤ p 2 for any function h randomly selected from H . Intuitively: two objects have a (high) probability x 1 ≥ p 1 to collide if they are closer than r , and a (low) probability x 2 ≤ p 2 if they are more distant than r (1 + ǫ ) . LSH-Index 3 : j randomly chosen functions h i ∈ H define a hash function g ( x ) = ( h 1 ( x ) h 2 ( x ) . . . h j ( x )) , i.e. bad collision probability is significantly lowered to p j 2 . t different hash tables are built, based on randomly generated g 1 . . . g t functions, in order to increase good collision probability. 3 P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality , STOC 1998, pages 604-613. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 14 / 48
Introduction Local similarity hashing methods Local similarity hashing methods It is hard to tune LSH-Index (length of hash keys) in order to obtain good efficacy, due to the dependence between data distribution and hash length. LSH-Forest 4 : Use of variable length hash keys . Long hash key are indexed in a prefix tree (LSH-Tree). At search time the key length is varied in order to retrieve a given number of candidate objects. Candidate objects are retrieved sequentially from a data storage on disk. Multiple LSH-Tree, i.e., a forest, are used to improve effectiveness. The PP-Index uses similar data structure. 4 M. Bawa, T. Condie, and P. Ganesan, LSH-Forest: self-tuning indexes for similarity search , WWW 2005, pages 651-660. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 15 / 48
The PP-Index Outline Introduction 1 The PP-Index 2 Experiments 3 Demo 4 Conclusions 5 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 16 / 48
The PP-Index Data structures Outline Introduction 1 The PP-Index 2 Data structures Algorithms Experiments 3 Demo 4 Conclusions 5 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 17 / 48
Recommend
More recommend