similarity search the metric space approach
play

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe - PowerPoint PPT Presentation

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko Table of Contents Part I: Metric searching in a nutshell Foundations of metric space searching Survey of exiting approaches Part


  1. SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko

  2. Table of Contents Part I: Metric searching in a nutshell  Foundations of metric space searching  Survey of exiting approaches Part II: Metric searching in large collections  Centralized index structures  Approximate similarity search  Parallel and distributed indexes P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 2

  3. Approximate similarity search  Approximate similarity search overcomes problems of exact similarity search using traditional access methods  Moderate improvement of performance with respect to sequential scan  Dimensionality curse  Similarity search returns mathematically precise result sets  Similarity is subjective so, in some cases, also approximate result sets satisfy the user  Approximate similarity search processes query faster at the price of imprecision in the returned result sets  Useful for instance in interactive systems Similarity search is an iterative process where temporary results are  used to create a new query  Improvements up to two orders of magnitude P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 3

  4. Approximate similarity search  Approximation strategies  Relaxed pruning conditions Data regions overlapping the query regions can be discarded  depending on the specific strategy  Early termination of the search algorithm Search algorithm might stop before all regions have been  accessed P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 4

  5. Approximate Similarity Search relative error approximation (pruning condition) 1. Range and k-NN search queries  good fraction approximation 2. small chance improvement approximation 3. proximity-based approximation 4. PAC nearest neighbor searching 5. performance trials 6. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 5

  6. Relative error approximation  Let o N be the nearest neighbour of q. If   A d o , q  1  e   N d o , q then o A is the (1+ e )-approximate nearest neighbor of q  This can be generalized to the k -th nearest neighbor   A d o , q  1  e   k N d o , q k P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 6

  7. Relative error approximation  Exact pruning strategy:     d q , p r r q p r p p r q q P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 7

  8. Relative error approximation  Approximate pruning strategy: r  e  q 1    d q , p r r p p p r q q r q /(1+ e  P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 8

  9. Approximate Similarity Search relative error approximation (pruning condition) 1. Range and k-NN search queries  good fraction approximation (stop condition) 2. K-NN search queries  small chance improvement approximation 3. proximity-based approximation 4. PAC nearest neighbor searching 5. performance trials 6. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 9

  10. Good fraction approximation  The k -NN algorithm determines the final result by reducing distances of current result set  When the current result set belongs to a specific fraction of the objects closest to the query, the approximate algorithm stops  Example: Stop when current result set belongs to the 10% of the objects closest to the query P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 10

  11. Good fraction approximation  For this strategy we use the distance distribution   defined as   F ( x ) Pr d ( o , q ) x q  The distance distribution F q ( x ) specifies what is the probability that the distance of a random object o from q is smaller than x  It is easy to see that F q ( x ) gives, in probabilistic terms, the fraction of the database corresponding to the set of objects whose distance from q is smaller than x P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 11

  12. Good fraction approximation 1 0,9 Fraction of the 0,8 data set whose 0,7 F q (x) distances from q 0,6 are smaller than 0,5 d(q,o k ) 0,4 d(q,o k ) 0,3 o k 0,2 0,1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 q P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 12

  13. Good fraction approximation  When F q ( d ( o k , q )) < r all objects of the current result set belong to the fraction r of the dataset o k q P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 13

  14. Good fraction approximation  F q ( x ) is difficult to be handled since we need to compute it for all possible queries  It was proven that the overall distance distribution F ( x ) defined as follows     F ( x ) Pr d ( o 1 o , ) x 2 can be used in practice, instead of F q ( x ), since they have statistically the same behaviour.  F ( x ) can be easily estimated as a discrete function and it can be easily maintained in main memory P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 14

  15. Approximate Similarity Search relative error approximation (pruning condition) 1. Range and k-NN search queries  good fraction approximation (stop condition) 2. K-NN search queries  small chance improvement approximation (stop c.) 3. K-NN search queries  proximity-based approximation 4. PAC nearest neighbor searching 5. performance trials 6. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 15

  16. Small chance improvement approximation  The M- Tree’s k -NN algorithm determines the final result by improving the current result set  Each step of the algorithm the temporary result is improved and the distance of the k-th element decreases  When the improvement of the temporary result set slows down, the algorithms can stop P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 16

  17. Small chance improvement approximation   A f ( x ) : d ( q , o ) k 0,38 0,36 0,34 0,32 Distance 0,3 0,28 0,26 0,24 0,22 0,2 0 500 1000 1500 Iteration P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 17

  18. Small chance improvement approximation Function f ( x ) is not known a priori.  A regression curve j ( x ), which approximate f ( x ),  is computed using the least square method while the algorithm proceeds Through the derivative of j ( x ) it is possible to  decide when the algorithm has to stop P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 18

  19. Small chance improvement approximation The regression curve has the following form  j  j  ( x ) c ( x ) c 1 1 2 where c 1 and c 2 are such that j    j   2 c ( i ) c f ( i ) 1 1 2  i 0 is minimum We have used both j 1 ( x )= ln ( x ) and j 1 ( x )= 1 / x  P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 19

  20. Regression curves 0,4 0,38 0,36 0,34 Distance 0,32 Distance 0,3 Hyperbolic Regr. Logarithmic Regr. 0,28 0,26 0,24 0,22 0,2 0 500 1000 1500 Iteration P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 20

  21. Approximate Similarity Search relative error approximation (pruning condition) 1. Range and k-NN search queries  good fraction approximation (stop condition) 2. K-NN search queries  small chance improvement approximation (stop c.) 3. K-NN search queries  proximity-based approximation (pruning cond.) 4. Range and k-NN search queries  PAC nearest neighbor searching 5. performance trials 6. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 21

  22. Proximity-based approximation  Regions whose probability of containing qualifying objects is below a certain threshold are pruned even if they overlap the query region  Proximity between regions is defined as the probability that a randomly chosen object appears in both the regions.  This resulted in an increase of performance of two orders of magnitude both for range queries and nearest neighbour queries P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 22

  23. Proximity-based approximation 1.R R 1.1 1 1.q 1.q 1.R R 1.2 1.2 R 1.R 1.3 3 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1 23

Recommend


More recommend