SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko
Table of Contents Part I: Metric searching in a nutshell  Foundations of metric space searching  Survey of existing approaches Part II: Metric searching in large collections  Centralized index structures  Approximate similarity search  Parallel and distributed indexes P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 2
Survey of existing approaches ball partitioning methods 1. generalized hyper-plane partitioning approaches 2. exploiting pre-computed distances 3. hybrid indexing approaches 4. approximated techniques 5. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 3
Survey of existing approaches ball partitioning methods 1. Burkhard-Keller Tree 1. Fixed Queries Tree 2. Fixed Queries Array 3. Vantage Point Tree 4. Multi-Way Vantage Point Tree 1. Excluded Middle Vantage Point Forest 5. generalized hyper-plane partitioning approaches 2. exploiting pre-computed distances 3. hybrid indexing approaches 4. approximated techniques 5. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 4
Burkhard-Keller Tree (BKT) [BK73]  Applicable to discrete distance functions only  Recursively divides a given dataset X  Choose an arbitrary point p j  X, form subsets: X i = { o  X, d ( o,p j ) = i } for each distance i ≥ 0.  For each X i create a sub-tree of p j  empty subsets are ignored p j X 2 p j X 2 X 3 X 4 X 3 X 4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 5
BKT: Range Query Given a query R(q,r) :  traverse the tree starting from root  in each internal node p j , do: if d ( q,p j ) ≤ r  report p j on output if max{ d ( q,p j ) – r, 0 } ≤ i ≤ d ( q,p j ) + r  enter a child i p 1 r 2 3 4 p 2 q p 2 p 3 p 1 3 5 p 3 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 6
Fixed Queries Tree (FQT)  modification of BKT  each level has a single pivot  all objects stored in leaves  during search distance computations are saved  usually more branches are accessed  one distance comp. p 1 r 0 4 2 3 p 2 q p 2 p 2 p 1 p 1 0 3 4 5 p 2 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 7
Fixed-Height FQT (FHFQT)  extension of FQT  all leaf nodes at the same level r p 2 q  increased filtering using more routing p 1 objects  extended tree depth does not typically introduce further computations p 1 p 1 0 4 0 4 2 3 2 3 p 2 p 2 p 2 p 1 0 3 4 5 2 0 3 4 5 6 p 2 p 1 p 2 FQT FHFQT P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 8
Fixed Queries Array (FQA)  based on FHFQT  an h -level tree is transformed to an array of paths  every leaf node is represented with a path from the root node  each path is encoded as h values of distance  a search algorithm turns to a binary search in array intervals p 1 0 4 2 3 p 1 p 2 0 2 2 3 3 4 p 2 2 0 3 4 5 6 2 0 3 4 5 6 p 1 p 2 FHFQT P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 9
Vantage Point Tree (VPT)  uses ball partitioning  recursively divides given data set X  choose vantage point p  X, compute median m  S 1 = { x  X – { p } | d ( x,p ) ≤ m }  S 2 = { x  X – { p } | d ( x,p ) ≥ m }  the equality sign ensures balancing m 1 p 1 m 2 p 1 p 2 p 2 S 1,1 S 1,2 S 1,1 S 1,2 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 10
VPT (cont.)  One or more objects can be accommodated in leaves.  VP tree is a balanced binary tree. p 1 m 1  Static structure p 2 m 2 p 3 m 3 o 4 o 1 o 3 o 8 o 9 o 11 o 7 o 2 o 6 o 5 o 10 o 12  Pivots p 1 ,p 2 and p 3 belong to the database!  In the following, we assume just one object in a leaf. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 11
VPT: Range Search Given a query R ( q,r ) :  traverse the tree starting from its root  in each internal node ( p i ,m i ) , do:  if d ( q,p i ) ≤ r report p i on output  if d ( q,p i ) - r ≤ m i search the left sub-tree (a,b)  if d ( q,p i ) + r ≥ m i search the right sub-tree (b) m i m i r p i p i r q q (a) (b) P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 12
VPT: k-NN Search Given a query NN ( q ):  initialization: d NN =d max NN=nil  traverse the tree starting from its root  in each internal node (p i ,m i ), do:  if d ( q,p i ) ≤ d NN set d NN = d ( q,p i ) , NN=p i  if d ( q,p i ) - d NN ≤ m i search the left sub-tree  if d ( q,p i ) + d NN ≥ m i search the right sub-tree  k-NN search only requires the arrays d NN [ k ] and NN [ k ]  The arrays are kept ordered with respect to the distance to q . P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 13
Multi-Way Vantage Point Tree  inherits all principles from VPT  but partitioning is modified  m -ary balanced tree  applies multi-way ball partitioning m 3 m 2 p 1 p 1 m 1 S 1,1 S 1,1 S 1,2 S 1,3 S 1,4 S 1,2 S 1,3 S 1,4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 14
Vantage Point Forest (VPF)  a forest of binary trees  uses excluded middle partitioning 2 r m i m i p i p i  middle area is excluded from the process of tree building P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 15
VPF (cont.)  given data set X is recursively divided and a binary tree is built  excluded middle areas are used for building another binary tree X M 1 + M 2 + M 3 p’ 1 p 1 p’ 2 M’ 1 p’ 3 p 2 M 1 p 3 S’ 1,1 M’ 2 S’ 1,2 S’ 2,1 M’ 3 S’ 2,2 S 1,1 M 2 S 1,2 S 2,1 M 3 S 2,2 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 16
VPF: Range Search Given a query R(q,r) :  start with the first tree  traverse the tree starting from its root  in each internal node (p i ,m i ), do: if d(q,p i ) ≤ r report p i  if d(q,p i ) – r ≤ m i – r search the left sub-tree   if d(q,p i ) + r ≥ m i – r search the next tree !!! if d(q,p i ) + r ≥ m i + r search the right sub-tree   if d(q,p i ) – r ≤ m i + r search the next tree !!! if d(q,p i ) – r ≥ m i – r and  d(q,p i ) + r ≤ m i + r search only the next tree !!! P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 17
VPF: Range Search (cont.)  Query intersects all  Query collides only with partitions exclusion  Search both sub-trees  Search just the next tree  Search the next tree 2 r 2 r m i m i p i p i r q q P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 18
Survey of existing approaches ball partitioning methods 1. generalized hyper-plane partitioning 2. approaches Bisector Tree 1. Generalized Hyper-plane Tree 2. exploiting pre-computed distances 3. hybrid indexing approaches 4. approximated techniques 5. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 19
Bisector Tree (BT)  Applies generalized hyper-plane partitioning  Recursively divides a given dataset X  Choose two arbitrary points p 1 ,p 2  X c r 2  Form subsets from remaining objects: S 1 = { o  X, d ( o,p 1 ) ≤ d ( o,p 2 )} S 2 = { o  X, d ( o,p 1 ) > d ( o,p 2 )} p 2 c and r 2 c are  Covering radii r 1 c r 1 established:  The balls can intersect! p 1 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 20
BT: Range Query Given a query R ( q,r ) :  traverse the tree starting from its root  in each internal node < p i ,p j >, do: c r j if d ( q,p x ) ≤ r  report p x on output if d ( q,p x ) – r ≤ r x c  enter a child of p x p j p i p j r c r i q p i P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 21
Monotonous Bisector Tree (MBT)  A variant of Bisector Tree  Child nodes inherit one pivot from the parent.  For convenience, no covering radii are shown. Bisector Tree Monotonous Bisector Tree p 2 p 2 p 5 p 6 p 3 p 3 p 4 p 1 p 1 p 4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 22
MBT (cont.)  Fewer pivots used  fewer distance evaluations during query processing & more objects in leaves. Bisector Tree Monotonous Bisector Tree p 1 p 2 p 1 p 2 p 3 p 4 p 5 p 6 p 1 p 3 p 2 p 4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 23
Voronoi Tree  Extension of Bisector Tree  Uses more pivots in each internal node  Usually three pivots p 2 c r 2 p 3 p 1 c r 3 c r 1 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 24
Recommend
More recommend