analysis of hierarchical metric tree indexing schemes
play

Analysis of hierarchical metric-tree indexing schemes for - PowerPoint PPT Presentation

Analysis of hierarchical metric-tree indexing schemes for similarity search in high-dimensional datasets Vladimir Pestov vpest283@uottawa.ca http://aix1.uottawa.ca/ vpest283 Department of Mathematics and Statistics University of Ottawa


  1. Pruning • If B ε ( ω ) ∩ B = ∅ , the sub -tree descending from the node B can be pruned: A B ε ω ε B ε A that is, if it can be certified that ∈ B ε = { x ∈ Ω: d ( x, B ) < ε } . ω / Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.7/25

  2. Pruning • If B ε ( ω ) ∩ B = ∅ , the sub -tree descending from the node B can be pruned: A B B A ε ε ω ω ε ε B ε B ε A A that is, if it can be certified that ∈ B ε = { x ∈ Ω: d ( x, B ) < ε } . ω / • Otherwise the search branches out. Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.7/25

  3. Pruning • If B ε ( ω ) ∩ B = ∅ , the sub -tree descending from the node B can be pruned: A B B A ε ε ω ω ε ε B ε B ε A A that is, if it can be certified that ∈ B ε = { x ∈ Ω: d ( x, B ) < ε } . ω / • Otherwise the search branches out. How to “certify” that B ε ( ω ) ∩ B = ∅ ? Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.7/25

  4. Decision functions Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.8/25

  5. Decision functions Let f : Ω → R be a 1 -Lipschitz function, | f ( x ) − f ( y ) | ≤ d ( x, y ) ∀ x, y ∈ Ω , Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.8/25

  6. Decision functions Let f : Ω → R be a 1 -Lipschitz function, | f ( x ) − f ( y ) | ≤ d ( x, y ) ∀ x, y ∈ Ω , such that f ↾ B ≤ 0 . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.8/25

  7. Decision functions Let f : Ω → R be a 1 -Lipschitz function, | f ( x ) − f ( y ) | ≤ d ( x, y ) ∀ x, y ∈ Ω , such that f ↾ B ≤ 0 . Then f ↾ B ε < ε , Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.8/25

  8. Decision functions Let f : Ω → R be a 1 -Lipschitz function, | f ( x ) − f ( y ) | ≤ d ( x, y ) ∀ x, y ∈ Ω , such that f ↾ B ≤ 0 . Then f ↾ B ε < ε , f f(x) ε B 0 x y Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.8/25

  9. Decision functions Let f : Ω → R be a 1 -Lipschitz function, | f ( x ) − f ( y ) | ≤ d ( x, y ) ∀ x, y ∈ Ω , such that f ↾ B ≤ 0 . Then f ↾ B ε < ε , f f(x) ε B 0 x y that is, f ( ω ) ≥ ε Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.8/25

  10. Decision functions Let f : Ω → R be a 1 -Lipschitz function, | f ( x ) − f ( y ) | ≤ d ( x, y ) ∀ x, y ∈ Ω , such that f ↾ B ≤ 0 . Then f ↾ B ε < ε , f f(x) ε B 0 x y that is, f ( ω ) ≥ ε is a certificate that B ε ( ω ) ∩ B = ∅ Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.8/25

  11. Metric trees A metric tree for a metric similarity workload (Ω , ρ, X ) : a binary rooted tree T , a collection of partially defined 1 -Lipschitz functions f t : B t → R for every inner node t (decision functions), a collection of bins B t ⊆ Ω for every leaf node t , containing pointers to elements X ∩ B t , such that B root ( T ) = Ω , ∀ inner node t and child nodes t − , t + , B t ⊆ B t − ∪ B t + . When processing a range query B ε ( ω ) , t − [ t + ] is accessed ⇐ ⇒ f t ( ω ) < ε [resp. f t ( ω ) > − ε ]. Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.9/25

  12. What happens in practice? Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.10/25

  13. What happens in practice? The best indexing schemes for exact similarity search in high -dimensional outer datasets are often (not always!) outperformed by linear scan. ∗ ∗ ∗ Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.10/25

  14. What happens in practice? The best indexing schemes for exact similarity search in high -dimensional outer datasets are often (not always!) outperformed by linear scan. ∗ ∗ ∗ The emphasis has shifted towards approximate similarity search: Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.10/25

  15. What happens in practice? The best indexing schemes for exact similarity search in high -dimensional outer datasets are often (not always!) outperformed by linear scan. ∗ ∗ ∗ The emphasis has shifted towards approximate similarity search: given ε > 0 and ω ∈ Ω , return a point that is [with high probability] at a distance < (1 + ε ) d NN ( ω ) from ω . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.10/25

  16. The curse of dimensionality conjecture Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

  17. The curse of dimensionality conjecture Conjecture. Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

  18. The curse of dimensionality conjecture Let X ⊆ { 0 , 1 } d be a dataset with n points, Conjecture. where the Hamming cube is equipped with the Hamming ( ℓ 1 ) distance: d ( x, y ) = ♯ { i : x i � = y i } . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

  19. The curse of dimensionality conjecture Let X ⊆ { 0 , 1 } d be a dataset with n points, Conjecture. where the Hamming cube is equipped with the Hamming ( ℓ 1 ) distance: d ( x, y ) = ♯ { i : x i � = y i } . Suppose d = n o (1) , but d = ω (log n ) . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

  20. The curse of dimensionality conjecture Let X ⊆ { 0 , 1 } d be a dataset with n points, Conjecture. where the Hamming cube is equipped with the Hamming ( ℓ 1 ) distance: d ( x, y ) = ♯ { i : x i � = y i } . Suppose d = n o (1) , but d = ω (log n ) . Any data structure for exact nearest neighbour search in X , Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

  21. The curse of dimensionality conjecture Let X ⊆ { 0 , 1 } d be a dataset with n points, Conjecture. where the Hamming cube is equipped with the Hamming ( ℓ 1 ) distance: d ( x, y ) = ♯ { i : x i � = y i } . Suppose d = n o (1) , but d = ω (log n ) . Any data structure for with d O (1) query exact nearest neighbour search in X , time, Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

  22. The curse of dimensionality conjecture Let X ⊆ { 0 , 1 } d be a dataset with n points, Conjecture. where the Hamming cube is equipped with the Hamming ( ℓ 1 ) distance: d ( x, y ) = ♯ { i : x i � = y i } . Suppose d = n o (1) , but d = ω (log n ) . Any data structure for with d O (1) query exact nearest neighbour search in X , time, must use n ω (1) space. ∗ ∗ ∗ Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

  23. The curse of dimensionality conjecture Let X ⊆ { 0 , 1 } d be a dataset with n points, Conjecture. where the Hamming cube is equipped with the Hamming ( ℓ 1 ) distance: d ( x, y ) = ♯ { i : x i � = y i } . Suppose d = n o (1) , but d = ω (log n ) . Any data structure for with d O (1) query exact nearest neighbour search in X , time, must use n ω (1) space. ∗ ∗ ∗ The cell probe model : Ω( d/ log n ) lower bound (Barkol–Rabani, 2000). Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.11/25

  24. Concentration of measure Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.12/25

  25. Concentration of measure The phenomenon of concentration of measure on high- dimensional structures ( “Geometric LLN” ): Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.12/25

  26. Concentration of measure The phenomenon of concentration of measure on high-dimensional structures ( “Geometric LLN” ): for a typical “high -dimensional” structure Ω , if A is a subset containing at least half of all points, then the measure of the ε -neighbourhood A ε of A is overwhelmingly close to 1 already for small ε > 0 . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.12/25

  27. Concentration of measure The phenomenon of concentration of measure on high-dimensional structures ( “Geometric LLN” ): for a typical “high -dimensional” structure Ω , if A is a subset containing at least half of all points, then the measure of the ε -neighbourhood A ε of A is overwhelmingly close to 1 already for small ε > 0 . Ω ε Α contains at least half of all points ������������������������� ������������������������� A ������������������������� ������������������������� ������������������������� ������������������������� Ω \ A ε ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� α(Ω,ε) ������������������������� ������������������������� ������������������������� ������������������������� ) bounds \ A ε µ(Ω ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� ������������������������� from above ������������������������� ������������������������� ������������������������� ������������������������� A ε ������������������������� ������������������������� Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.12/25

  28. Concentration function Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.13/25

  29. Concentration function Let Ω = (Ω , d, µ ) be a metric space with measure. Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.13/25

  30. Concentration function Let Ω = (Ω , d, µ ) be a metric space with measure. The concentration function of Ω : � 1 if ε = 0 , 2 , α ( ε ) = µ ♯ ( A ε ) : A ⊆ Ω , µ ♯ ( A ) ≥ 1 � � 1 − min if ε > 0 . , 2 Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.13/25

  31. Concentration function Let Ω = (Ω , d, µ ) be a metric space with measure. The concentration function of Ω : � 1 if ε = 0 , 2 , α ( ε ) = µ ♯ ( A ε ) : A ⊆ Ω , µ ♯ ( A ) ≥ 1 � � 1 − min if ε > 0 . , 2 For Ω = Σ n , the Hamming cube (normalized distance + unif. measure): α Σ n ( ε ) ≤ e − 2 ε 2 n . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.13/25

  32. Concentration function Let Ω = (Ω , d, µ ) be a metric space with measure. The concentration function of Ω : � 1 if ε = 0 , 2 , α ( ε ) = µ ♯ ( A ε ) : A ⊆ Ω , µ ♯ ( A ) ≥ 1 � � 1 − min if ε > 0 . , 2 For Ω = Σ n , the Hamming cube (normalized distance + unif. measure): α Σ n ( ε ) ≤ e − 2 ε 2 n . Gaussian estimates are typical (Euclidean spheres S n , cubes I n , ...) Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.13/25

  33. Example: the Hamming cube Concentration function versus Chernoff’s bound, n = 101 1 Concentration function Chernoff bound 0.8 0.6 0.4 0.2 0 0 0.05 0.1 0.15 0.2 Concentration function α (Σ 101 , ε ) versus Chernoff bound Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.14/25

  34. Effects of concentration on branching Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.15/25

  35. Effects of concentration on branching C < α (C, ε) < α (C, ε) B A ε ω ε B ε A Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.15/25

  36. Effects of concentration on branching C < α (C, ε) < α (C, ε) B A ε ω ε B ε A For all query points ω ∈ C except a set of measure ≤ 2 α ( C, ε ) , the search algorithm branches out at the node C . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.15/25

  37. Search radius ε NN ( ω ) is a 1 -Lipschitz function, so concentrates near the median value, ε M ; ε M → E µ ⊗ µ d ( x, y ) = O (1) . Example: 1000 pts ∼ [0 , 1] 10 , the ℓ 2 - ε NN : E d ( x, y ) = 1 . 2765 . ε M = 0 . 69419 Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.16/25

  38. A naive average O ( n ) lower bound Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

  39. A naive average O ( n ) lower bound Suppose datapoints are distributed according to µ ∈ P (Ω) ... Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

  40. A naive average O ( n ) lower bound Suppose datapoints are distributed according to µ ∈ P (Ω) ... ...as well as query points. Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

  41. A naive average O ( n ) lower bound Suppose datapoints are distributed according to µ ∈ P (Ω) ... ...as well as query points. A balanced metric tree of depth O (log n ) , with O ( n ) bins of roughly equal size ( µ -measure). Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

  42. A naive average O ( n ) lower bound Suppose datapoints are distributed according to µ ∈ P (Ω) ... ...as well as query points. A balanced metric tree of depth O (log n ) , with O ( n ) bins of roughly equal size ( µ -measure). in 1 / 2 the cases, ε NN ≥ ε M = O (1) , the median NN dist. Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

  43. A naive average O ( n ) lower bound Suppose datapoints are distributed according to µ ∈ P (Ω) ... ...as well as query points. A balanced metric tree of depth O (log n ) , with O ( n ) bins of roughly equal size ( µ -measure). in 1 / 2 the cases, ε NN ≥ ε M = O (1) , the median NN dist. For every element A of level t partition, α ( A, ε M ) ≤ 2 µ ( A ) − 1 α (Ω , ε M / 2) = O (2 t ) e − O (1) ε 2 M d . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

  44. A naive average O ( n ) lower bound Suppose datapoints are distributed according to µ ∈ P (Ω) ... ...as well as query points. A balanced metric tree of depth O (log n ) , with O ( n ) bins of roughly equal size ( µ -measure). in 1 / 2 the cases, ε NN ≥ ε M = O (1) , the median NN dist. For every element A of level t partition, α ( A, ε M ) ≤ 2 µ ( A ) − 1 α (Ω , ε M / 2) = O (2 t ) e − O (1) ε 2 M d . � branching at every node occurs for all ω except Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

  45. A naive average O ( n ) lower bound Suppose datapoints are distributed according to µ ∈ P (Ω) ... ...as well as query points. A balanced metric tree of depth O (log n ) , with O ( n ) bins of roughly equal size ( µ -measure). in 1 / 2 the cases, ε NN ≥ ε M = O (1) , the median NN dist. For every element A of level t partition, α ( A, ε M ) ≤ 2 µ ( A ) − 1 α (Ω , ε M / 2) = O (2 t ) e − O (1) ε 2 M d . � branching at every node occurs for all ω except α ( A, ε ) = O ( n 2 ) e − O (1) d = o (1) , ♯ ( nodes ) × 2 sup A because d = ω (log n ) , � e − O (1) d is superpoly ( n ) . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.17/25

  46. What’s wrong? Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.18/25

  47. What’s wrong? A dataset X is modeled by a sequence of i.i.d. r.v. X i ∼ µ . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.18/25

  48. What’s wrong? A dataset X is modeled by a sequence of i.i.d. r.v. X i ∼ µ . Implicit assumption: empirical measure µ n ( A ) = | A | n ≈ µ ( A ) . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.18/25

  49. What’s wrong? A dataset X is modeled by a sequence of i.i.d. r.v. X i ∼ µ . Implicit assumption: empirical measure µ n ( A ) = | A | n ≈ µ ( A ) . But the scheme is chosen after seeing an instance X ! Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.18/25

  50. What’s wrong? A dataset X is modeled by a sequence of i.i.d. r.v. X i ∼ µ . Implicit assumption: empirical measure µ n ( A ) = | A | n ≈ µ ( A ) . But the scheme is chosen after seeing an instance X ! 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 How much can be said of concentration in (Ω , µ n ) ? Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.18/25

  51. VC dimension Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.19/25

  52. VC dimension Let A be a family of subsets of Ω (a concept class ). B ⊆ Ω is shattered by A if for each C ⊆ B there is A ∈ A such that A ∩ B = C. Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.19/25

  53. VC dimension Let A be a family of subsets of Ω (a concept class ). B ⊆ Ω is shattered by A if for each C ⊆ B there is A ∈ A such that A ∩ B = C. Ω A B C Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.19/25

  54. VC dimension Let A be a family of subsets of Ω (a concept class ). B ⊆ Ω is shattered by A if for each C ⊆ B there is A ∈ A such that A ∩ B = C. Ω A B C The Vapnik–Chervonenkis dimension VC -dim ( A ) of A is the largest cardinality of a set B ⊆ Ω shattered by A . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.19/25

  55. Statistical learning bounds Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.20/25

  56. Statistical learning bounds Let A ⊆ 2 Ω be a concept class of finite VC dimension, d . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.20/25

  57. Statistical learning bounds Let A ⊆ 2 Ω be a concept class of finite VC dimension, d . Then for all ǫ, δ > 0 and every probability measure µ on Ω , Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.20/25

  58. Statistical learning bounds Let A ⊆ 2 Ω be a concept class of finite VC dimension, d . Then for all ǫ, δ > 0 and every probability measure µ on Ω , if n datapoints in X are drawn randomly and independently acoording to µ , then with confidence 1 − δ � � � µ ( A ) − X ∩ A � � ∀ A ∈ A , � < ǫ, � � n Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.20/25

  59. Statistical learning bounds Let A ⊆ 2 Ω be a concept class of finite VC dimension, d . Then for all ǫ, δ > 0 and every probability measure µ on Ω , if n datapoints in X are drawn randomly and independently acoording to µ , then with confidence 1 − δ � � � µ ( A ) − X ∩ A � � ∀ A ∈ A , � < ǫ, � � n provided n is large enough: � 2 e 2 n ≥ 128 � ε log 2 e � + log 8 � d log . ε 2 ε δ Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.20/25

  60. Bin access lemma Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.21/25

  61. Bin access lemma Let δ > 0 , and let γ be a collection of subsets A ⊆ Ω of measure µ ( A ) ≤ α ( δ ) ≤ 1 4 each, satisfying µ ( ∪ γ ) ≥ 1 / 2 . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.21/25

  62. Bin access lemma Let δ > 0 , and let γ be a collection of subsets A ⊆ Ω of measure µ ( A ) ≤ α ( δ ) ≤ 1 4 each, satisfying µ ( ∪ γ ) ≥ 1 / 2 . Then the 2 δ -neighbourhood of every point ω ∈ Ω , apart from 1 2 α ( δ ) − 1 a set of measure at most 1 2 , meets at least ⌈ 1 2 ⌉ 2 α ( δ ) elements of γ . ∗ ∗ ∗ Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.21/25

  63. Bin access lemma Let δ > 0 , and let γ be a collection of subsets A ⊆ Ω of measure µ ( A ) ≤ α ( δ ) ≤ 1 4 each, satisfying µ ( ∪ γ ) ≥ 1 / 2 . Then the 2 δ -neighbourhood of every point ω ∈ Ω , apart from 1 2 α ( δ ) − 1 a set of measure at most 1 2 , meets at least ⌈ 1 2 ⌉ 2 α ( δ ) elements of γ . ∗ ∗ ∗ If we can now guarantee that the bins are not too large, we get a lower bound on the number of bin accesses. Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.21/25

  64. Bin complexity estimates Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.22/25

  65. Bin complexity estimates Let F be a class of 1 -Lipschitz functions used for constructing a metric tree of a particular type. Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.22/25

  66. Bin complexity estimates Let F be a class of 1 -Lipschitz functions used for constructing a metric tree of a particular type. Let A be the concept class of all solution sets to inequalities f � a, f ∈ F , a ∈ R . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.22/25

  67. Bin complexity estimates Let F be a class of 1 -Lipschitz functions used for constructing a metric tree of a particular type. Let A be the concept class of all solution sets to inequalities f � a, f ∈ F , a ∈ R . Suppose p = VC-dim ( A ) < ∞ ( pseudodimension of F in the sense of Vapnik ). Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.22/25

  68. Bin complexity estimates Let F be a class of 1 -Lipschitz functions used for constructing a metric tree of a particular type. Let A be the concept class of all solution sets to inequalities f � a, f ∈ F , a ∈ R . Suppose p = VC-dim ( A ) < ∞ ( pseudodimension of F in the sense of Vapnik ). Denote B the class of all bins of all possible metric trees of depth ≤ h built using F . Then VC-dim ( B ) ≤ 2 hp log( hp ) = O ( hp ) . Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.22/25

  69. Rigorous lower bounds Metric tree indexing schemes for similarity search Vladimir Pestov, University of Ottawa AofA 2008, Maresias, SP , Brazil – p.23/25

Recommend


More recommend