SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko
Table of Contents Part I: Metric searching in a nutshell Foundations of metric space searching Survey of existing approaches Part II: Metric searching in large collections Centralized index structures Approximate similarity search Parallel and distributed indexes P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 2
Survey of existing approaches ball partitioning methods 1. generalized hyper-plane partitioning approaches 2. exploiting pre-computed distances 3. hybrid indexing approaches 4. approximated techniques 5. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 3
Survey of existing approaches ball partitioning methods 1. Burkhard-Keller Tree 1. Fixed Queries Tree 2. Fixed Queries Array 3. Vantage Point Tree 4. Multi-Way Vantage Point Tree 1. Excluded Middle Vantage Point Forest 5. generalized hyper-plane partitioning approaches 2. exploiting pre-computed distances 3. hybrid indexing approaches 4. approximated techniques 5. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 4
Burkhard-Keller Tree (BKT) [BK73] Applicable to discrete distance functions only Recursively divides a given dataset X Choose an arbitrary point p j X, form subsets: X i = { o X, d ( o,p j ) = i } for each distance i ≥ 0. For each X i create a sub-tree of p j empty subsets are ignored p j X 2 p j X 2 X 3 X 4 X 3 X 4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 5
BKT: Range Query Given a query R(q,r) : traverse the tree starting from root in each internal node p j , do: if d ( q,p j ) ≤ r report p j on output if max{ d ( q,p j ) – r, 0 } ≤ i ≤ d ( q,p j ) + r enter a child i p 1 r 2 3 4 p 2 q p 2 p 3 p 1 3 5 p 3 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 6
Fixed Queries Tree (FQT) modification of BKT each level has a single pivot all objects stored in leaves during search distance computations are saved usually more branches are accessed one distance comp. p 1 r 0 4 2 3 p 2 q p 2 p 2 p 1 p 1 0 3 4 5 p 2 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 7
Fixed-Height FQT (FHFQT) extension of FQT all leaf nodes at the same level r p 2 q increased filtering using more routing p 1 objects extended tree depth does not typically introduce further computations p 1 p 1 0 4 0 4 2 3 2 3 p 2 p 2 p 2 p 1 0 3 4 5 2 0 3 4 5 6 p 2 p 1 p 2 FQT FHFQT P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 8
Fixed Queries Array (FQA) based on FHFQT an h -level tree is transformed to an array of paths every leaf node is represented with a path from the root node each path is encoded as h values of distance a search algorithm turns to a binary search in array intervals p 1 0 4 2 3 p 1 p 2 0 2 2 3 3 4 p 2 2 0 3 4 5 6 2 0 3 4 5 6 p 1 p 2 FHFQT P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 9
Vantage Point Tree (VPT) uses ball partitioning recursively divides given data set X choose vantage point p X, compute median m S 1 = { x X – { p } | d ( x,p ) ≤ m } S 2 = { x X – { p } | d ( x,p ) ≥ m } the equality sign ensures balancing m 1 p 1 m 2 p 1 p 2 p 2 S 1,1 S 1,2 S 1,1 S 1,2 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 10
VPT (cont.) One or more objects can be accommodated in leaves. VP tree is a balanced binary tree. p 1 m 1 Static structure p 2 m 2 p 3 m 3 o 4 o 1 o 3 o 8 o 9 o 11 o 7 o 2 o 6 o 5 o 10 o 12 Pivots p 1 ,p 2 and p 3 belong to the database! In the following, we assume just one object in a leaf. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 11
VPT: Range Search Given a query R ( q,r ) : traverse the tree starting from its root in each internal node ( p i ,m i ) , do: if d ( q,p i ) ≤ r report p i on output if d ( q,p i ) - r ≤ m i search the left sub-tree (a,b) if d ( q,p i ) + r ≥ m i search the right sub-tree (b) m i m i r p i p i r q q (a) (b) P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 12
VPT: k-NN Search Given a query NN ( q ): initialization: d NN =d max NN=nil traverse the tree starting from its root in each internal node (p i ,m i ), do: if d ( q,p i ) ≤ d NN set d NN = d ( q,p i ) , NN=p i if d ( q,p i ) - d NN ≤ m i search the left sub-tree if d ( q,p i ) + d NN ≥ m i search the right sub-tree k-NN search only requires the arrays d NN [ k ] and NN [ k ] The arrays are kept ordered with respect to the distance to q . P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 13
Multi-Way Vantage Point Tree inherits all principles from VPT but partitioning is modified m -ary balanced tree applies multi-way ball partitioning m 3 m 2 p 1 p 1 m 1 S 1,1 S 1,1 S 1,2 S 1,3 S 1,4 S 1,2 S 1,3 S 1,4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 14
Vantage Point Forest (VPF) a forest of binary trees uses excluded middle partitioning 2 r m i m i p i p i middle area is excluded from the process of tree building P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 15
VPF (cont.) given data set X is recursively divided and a binary tree is built excluded middle areas are used for building another binary tree X M 1 + M 2 + M 3 p’ 1 p 1 p’ 2 M’ 1 p’ 3 p 2 M 1 p 3 S’ 1,1 M’ 2 S’ 1,2 S’ 2,1 M’ 3 S’ 2,2 S 1,1 M 2 S 1,2 S 2,1 M 3 S 2,2 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 16
VPF: Range Search Given a query R(q,r) : start with the first tree traverse the tree starting from its root in each internal node (p i ,m i ), do: if d(q,p i ) ≤ r report p i if d(q,p i ) – r ≤ m i – r search the left sub-tree if d(q,p i ) + r ≥ m i – r search the next tree !!! if d(q,p i ) + r ≥ m i + r search the right sub-tree if d(q,p i ) – r ≤ m i + r search the next tree !!! if d(q,p i ) – r ≥ m i – r and d(q,p i ) + r ≤ m i + r search only the next tree !!! P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 17
VPF: Range Search (cont.) Query intersects all Query collides only with partitions exclusion Search both sub-trees Search just the next tree Search the next tree 2 r 2 r m i m i p i p i r q q P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 18
Survey of existing approaches ball partitioning methods 1. generalized hyper-plane partitioning 2. approaches Bisector Tree 1. Generalized Hyper-plane Tree 2. exploiting pre-computed distances 3. hybrid indexing approaches 4. approximated techniques 5. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 19
Bisector Tree (BT) Applies generalized hyper-plane partitioning Recursively divides a given dataset X Choose two arbitrary points p 1 ,p 2 X c r 2 Form subsets from remaining objects: S 1 = { o X, d ( o,p 1 ) ≤ d ( o,p 2 )} S 2 = { o X, d ( o,p 1 ) > d ( o,p 2 )} p 2 c and r 2 c are Covering radii r 1 c r 1 established: The balls can intersect! p 1 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 20
BT: Range Query Given a query R ( q,r ) : traverse the tree starting from its root in each internal node < p i ,p j >, do: c r j if d ( q,p x ) ≤ r report p x on output if d ( q,p x ) – r ≤ r x c enter a child of p x p j p i p j r c r i q p i P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 21
Monotonous Bisector Tree (MBT) A variant of Bisector Tree Child nodes inherit one pivot from the parent. For convenience, no covering radii are shown. Bisector Tree Monotonous Bisector Tree p 2 p 2 p 5 p 6 p 3 p 3 p 4 p 1 p 1 p 4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 22
MBT (cont.) Fewer pivots used fewer distance evaluations during query processing & more objects in leaves. Bisector Tree Monotonous Bisector Tree p 1 p 2 p 1 p 2 p 3 p 4 p 5 p 6 p 1 p 3 p 2 p 4 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 23
Voronoi Tree Extension of Bisector Tree Uses more pivots in each internal node Usually three pivots p 2 c r 2 p 3 p 1 c r 3 c r 1 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 2 24
Recommend
More recommend