similarity search the metric space approach
play

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe - PowerPoint PPT Presentation

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko Table of Contents Part I: Metric searching in a nutshell Foundations of metric space searching Survey of existing approaches Part


  1. SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko

  2. Table of Contents Part I: Metric searching in a nutshell  Foundations of metric space searching  Survey of existing approaches Part II: Metric searching in large collections  Centralized index structures  Approximate similarity search  Parallel and distributed indexes P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 2

  3. Survey of existing approaches ball partitioning methods 1. generalized hyper-plane partitioning approaches 2. exploiting pre-computed distances 3. hybrid indexing approaches 4. approximated techniques 5. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 3

  4. Survey of existing approaches ball partitioning methods 1. Burkhard-Keller Tree 1. Fixed Queries Tree 2. Fixed Queries Array 3. Vantage Point Tree 4. Multi-Way Vantage Point Tree 1. Excluded Middle Vantage Point Forest 5. generalized hyper-plane partitioning approaches 2. exploiting pre-computed distances 3. hybrid indexing approaches 4. approximated techniques 5. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 4

  5. Burkhard-Keller Tree (BKT) [BK73]  Applicable to discrete distance functions only  Recursively divides a given dataset X  Choose an arbitrary point p j  X, form subsets: X i = { o  X, d ( o,p j ) = i } for each distance i ≥ 0.  For each X i create a sub-tree of p j  empty subsets are ignored p j X 2 p j X 2 X 3 X 4 X 3 X 4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 5

  6. BKT: Range Query Given a query R(q,r) :  traverse the tree starting from root  in each internal node p j , do: if d ( q,p j ) ≤ r  report p j on output if max{ d ( q,p j ) – r, 0 } ≤ i ≤ d ( q,p j ) + r  enter a child i p 1 r 2 3 4 p 2 q p 2 p 3 p 1 3 5 p 3 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 6

  7. Fixed Queries Tree (FQT)  modification of BKT  each level has a single pivot  all objects stored in leaves  during search distance computations are saved  usually more branches are accessed  one distance comp. p 1 r 0 4 2 3 p 2 q p 2 p 2 p 1 p 1 0 3 4 5 p 2 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 7

  8. Fixed-Height FQT (FHFQT)  extension of FQT  all leaf nodes at the same level r p 2 q  increased filtering using more routing p 1 objects  extended tree depth does not typically introduce further computations p 1 p 1 0 4 0 4 2 3 2 3 p 2 p 2 p 2 p 1 0 3 4 5 2 0 3 4 5 6 p 2 p 1 p 2 FQT FHFQT P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 8

  9. Fixed Queries Array (FQA)  based on FHFQT  an h -level tree is transformed to an array of paths  every leaf node is represented with a path from the root node  each path is encoded as h values of distance  a search algorithm turns to a binary search in array intervals p 1 0 4 2 3 p 1 p 2 0 2 2 3 3 4 p 2 2 0 3 4 5 6 2 0 3 4 5 6 p 1 p 2 FHFQT P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 9

  10. Vantage Point Tree (VPT)  uses ball partitioning  recursively divides given data set X  choose vantage point p  X, compute median m  S 1 = { x  X – { p } | d ( x,p ) ≤ m }  S 2 = { x  X – { p } | d ( x,p ) ≥ m }  the equality sign ensures balancing m 1 p 1 m 2 p 1 p 2 p 2 S 1,1 S 1,2 S 1,1 S 1,2 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 10

  11. VPT (cont.)  One or more objects can be accommodated in leaves.  VP tree is a balanced binary tree. p 1 m 1  Static structure p 2 m 2 p 3 m 3 o 4 o 1 o 3 o 8 o 9 o 11 o 7 o 2 o 6 o 5 o 10 o 12  Pivots p 1 ,p 2 and p 3 belong to the database!  In the following, we assume just one object in a leaf. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 11

  12. VPT: Range Search Given a query R ( q,r ) :  traverse the tree starting from its root  in each internal node ( p i ,m i ) , do:  if d ( q,p i ) ≤ r report p i on output  if d ( q,p i ) - r ≤ m i search the left sub-tree (a,b)  if d ( q,p i ) + r ≥ m i search the right sub-tree (b) m i m i r p i p i r q q (a) (b) P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 12

  13. VPT: k-NN Search Given a query NN ( q ):  initialization: d NN =d max NN=nil  traverse the tree starting from its root  in each internal node (p i ,m i ), do:  if d ( q,p i ) ≤ d NN set d NN = d ( q,p i ) , NN=p i  if d ( q,p i ) - d NN ≤ m i search the left sub-tree  if d ( q,p i ) + d NN ≥ m i search the right sub-tree  k-NN search only requires the arrays d NN [ k ] and NN [ k ]  The arrays are kept ordered with respect to the distance to q . P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 13

  14. Multi-Way Vantage Point Tree  inherits all principles from VPT  but partitioning is modified  m -ary balanced tree  applies multi-way ball partitioning m 3 m 2 p 1 p 1 m 1 S 1,1 S 1,1 S 1,2 S 1,3 S 1,4 S 1,2 S 1,3 S 1,4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 14

  15. Vantage Point Forest (VPF)  a forest of binary trees  uses excluded middle partitioning 2 r m i m i p i p i  middle area is excluded from the process of tree building P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 15

  16. VPF (cont.)  given data set X is recursively divided and a binary tree is built  excluded middle areas are used for building another binary tree X M 1 + M 2 + M 3 p’ 1 p 1 p’ 2 M’ 1 p’ 3 p 2 M 1 p 3 S’ 1,1 M’ 2 S’ 1,2 S’ 2,1 M’ 3 S’ 2,2 S 1,1 M 2 S 1,2 S 2,1 M 3 S 2,2 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 16

  17. VPF: Range Search Given a query R(q,r) :  start with the first tree  traverse the tree starting from its root  in each internal node (p i ,m i ), do: if d(q,p i ) ≤ r report p i  if d(q,p i ) – r ≤ m i – r search the left sub-tree   if d(q,p i ) + r ≥ m i – r search the next tree !!! if d(q,p i ) + r ≥ m i + r search the right sub-tree   if d(q,p i ) – r ≤ m i + r search the next tree !!! if d(q,p i ) – r ≥ m i – r and  d(q,p i ) + r ≤ m i + r search only the next tree !!! P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 17

  18. VPF: Range Search (cont.)  Query intersects all  Query collides only with partitions exclusion  Search both sub-trees  Search just the next tree  Search the next tree 2 r 2 r m i m i p i p i r q q P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 18

  19. Survey of existing approaches ball partitioning methods 1. generalized hyper-plane partitioning 2. approaches Bisector Tree 1. Generalized Hyper-plane Tree 2. exploiting pre-computed distances 3. hybrid indexing approaches 4. approximated techniques 5. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 19

  20. Bisector Tree (BT)  Applies generalized hyper-plane partitioning  Recursively divides a given dataset X  Choose two arbitrary points p 1 ,p 2  X c r 2  Form subsets from remaining objects: S 1 = { o  X, d ( o,p 1 ) ≤ d ( o,p 2 )} S 2 = { o  X, d ( o,p 1 ) > d ( o,p 2 )} p 2 c and r 2 c are  Covering radii r 1 c r 1 established:  The balls can intersect! p 1 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 20

  21. BT: Range Query Given a query R ( q,r ) :  traverse the tree starting from its root  in each internal node < p i ,p j >, do: c r j if d ( q,p x ) ≤ r  report p x on output if d ( q,p x ) – r ≤ r x c  enter a child of p x p j p i p j r c r i q p i P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 21

  22. Monotonous Bisector Tree (MBT)  A variant of Bisector Tree  Child nodes inherit one pivot from the parent.  For convenience, no covering radii are shown. Bisector Tree Monotonous Bisector Tree p 2 p 2 p 5 p 6 p 3 p 3 p 4 p 1 p 1 p 4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 22

  23. MBT (cont.)  Fewer pivots used  fewer distance evaluations during query processing & more objects in leaves. Bisector Tree Monotonous Bisector Tree p 1 p 2 p 1 p 2 p 3 p 4 p 5 p 6 p 1 p 3 p 2 p 4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 23

  24. Voronoi Tree  Extension of Bisector Tree  Uses more pivots in each internal node  Usually three pivots p 2 c r 2 p 3 p 1 c r 3 c r 1 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 24

Recommend


More recommend