Overview Our Work Discussion Curse of Dimensionality in Pivot-based Indexes Ilya Volnyansky, Vladimir Pestov Department of Mathematics and Statistics University of Ottawa Ottawa, Ontario, Canada SISAP 2009, Prague, 29/09/2009 Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Overview Our Work Discussion Outline Overview 1 The Setting for Similarity Search Previous Work Our Work 2 Framework Concentration of Measure Statistical Learning Theory Asymptotic Bounds Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Overview The Setting for Similarity Search Our Work Previous Work Discussion Similarity Workloads Universe Ω : metric space with metric 휌 . Dataset X ⊂ Ω , always finite, with metric 휌 . A range query : given q ∈ Ω and r > 0 find { x ∈ X ∣ 휌 ( x , q ) < r } For analysis purposes, we add: A measure 휇 on Ω . Treat X as i.i.d. sample ∼ 휇 of size n Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Overview The Setting for Similarity Search Our Work Previous Work Discussion Similarity Workloads Universe Ω : metric space with metric 휌 . Dataset X ⊂ Ω , always finite, with metric 휌 . A range query : given q ∈ Ω and r > 0 find { x ∈ X ∣ 휌 ( x , q ) < r } For analysis purposes, we add: A measure 휇 on Ω . Treat X as i.i.d. sample ∼ 휇 of size n Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Overview The Setting for Similarity Search Our Work Previous Work Discussion Curse of dimensionality conjecture All indexing schemes suffer from the curse of dimensionality: (conjecture) If d = 휔 ( log n ) and d = n o ( 1 ) , any sequence of indexes built on a sequence of datasets X d ⊂ Σ d allowing similarity search in time polynomial in d must use n 휔 ( 1 ) space. Handbook of Discrete and Computational Geometry The Hamming cube Σ d of dimension d : The set of all binary sequences of length d . Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Overview The Setting for Similarity Search Our Work Previous Work Discussion Curse of dimensionality conjecture All indexing schemes suffer from the curse of dimensionality: (conjecture) If d = 휔 ( log n ) and d = n o ( 1 ) , any sequence of indexes built on a sequence of datasets X d ⊂ Σ d allowing similarity search in time polynomial in d must use n 휔 ( 1 ) space. Handbook of Discrete and Computational Geometry The Hamming cube Σ d of dimension d : The set of all binary sequences of length d . Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Overview The Setting for Similarity Search Our Work Previous Work Discussion Fixed dimension Examples of previous work: Let n the size of X vary, but the space (Ω , 휌, 휇 ) be fixed. The usual “asymptotic” analysis in the CS sense. Does not investigate the curse of dimensionality. Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Overview The Setting for Similarity Search Our Work Previous Work Discussion Fixed n Let the dimension and hence (Ω , 휌, 휇 ) vary but the size n of X stay the same. e.g. [Weber 98], [Chávez 01] Too small sample size n makes it easier to index spaces of high dimension d . When both d and n vary, the math is more challenging. Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Overview The Setting for Similarity Search Our Work Previous Work Discussion Points to keep in mind Distinction between X and Ω . Both d and n grow. Need to make assumptions about the sequence of Ω ’s (?) Need to make assumption about the indexes. Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Overview The Setting for Similarity Search Our Work Previous Work Discussion Gameplan Pick an index type to analyze. 1 Pick a cost model. 2 The sequence of Ω ’s exhibits concentration of measure, 3 the “intrinsic dimension” grows. Statistical Learning Theory: linking properties of Ω ’s and 4 properties of X ’s. Conclusion: if all conditions are met, the Curse of 5 Dimensionality will take place. Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Overview The Setting for Similarity Search Our Work Previous Work Discussion Main Result From a sequence of metric spaces with measure (Ω d , 휌 d , 휇 d ) , where d = 1 , 2 , 3 , . . . take i.i.d. samples (datasets) X d ∼ 휇 d . Assume (Ω d , 휌 d , 휇 d ) display the concentration of measure. The VC dimension of closed balls in (Ω d , 휌 d ) is O ( d ) . We build a pivot-index using k pivots, where k = o ( n d / d ) . Sample size n d satisfies d = 휔 ( log n d ) and d = n o ( 1 ) . d Suppose we perform queries of radius=NN. Then: If we fix arbitrarily small 휀, 휂 > 0, ∃ D such that for all d ⩾ D , the probability that at least half the queries on dataset X d take less than ( 1 − 휀 ) n d time is less than 휂 . Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Overview The Setting for Similarity Search Our Work Previous Work Discussion Main Result From a sequence of metric spaces with measure (Ω d , 휌 d , 휇 d ) , where d = 1 , 2 , 3 , . . . take i.i.d. samples (datasets) X d ∼ 휇 d . Assume (Ω d , 휌 d , 휇 d ) display the concentration of measure. The VC dimension of closed balls in (Ω d , 휌 d ) is O ( d ) . We build a pivot-index using k pivots, where k = o ( n d / d ) . Sample size n d satisfies d = 휔 ( log n d ) and d = n o ( 1 ) . d Suppose we perform queries of radius=NN. Then: If we fix arbitrarily small 휀, 휂 > 0, ∃ D such that for all d ⩾ D , the probability that at least half the queries on dataset X d take less than ( 1 − 휀 ) n d time is less than 휂 . Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Framework Overview Concentration of Measure Our Work Statistical Learning Theory Discussion Asymptotic Bounds Pivot indexing scheme Build an index: Pick { p 1 . . . p k } from X 1 Calculate n × k array of distances 2 휌 ( x , p i ) , 1 ⩽ i ⩽ k , x ∈ X Perform query given q and r : Compute 휌 k ( q , x ) := sup 1 ⩽ i ⩽ k ∣ 휌 ( q , p i ) − 휌 ( x , p i ) ∣ . 1 Since 휌 ( q , x ) ⩾ 휌 k ( q , x ) , no need to compute 휌 ( q , x ) if 2 휌 k ( q , x ) > r Compute 휌 ( q , x ) otherwise. 3 Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Framework Overview Concentration of Measure Our Work Statistical Learning Theory Discussion Asymptotic Bounds Pivot indexing scheme Build an index: Pick { p 1 . . . p k } from X 1 Calculate n × k array of distances 2 휌 ( x , p i ) , 1 ⩽ i ⩽ k , x ∈ X Perform query given q and r : Compute 휌 k ( q , x ) := sup 1 ⩽ i ⩽ k ∣ 휌 ( q , p i ) − 휌 ( x , p i ) ∣ . 1 Since 휌 ( q , x ) ⩾ 휌 k ( q , x ) , no need to compute 휌 ( q , x ) if 2 휌 k ( q , x ) > r Compute 휌 ( q , x ) otherwise. 3 Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Framework Overview Concentration of Measure Our Work Statistical Learning Theory Discussion Asymptotic Bounds The cost model Only one cost: 휌 ( q , x ) Computing 휌 k ( q , x ) costs k . Let C q , r , p 1 ,..., p k denote all the discarded points in X : { x ∈ X ∣ 휌 k ( q , x ) > r } Let n = ∣ X ∣ . Total cost: k + n − ∣ C q , r , p 1 ,..., p k ∣ . Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Framework Overview Concentration of Measure Our Work Statistical Learning Theory Discussion Asymptotic Bounds The cost model Only one cost: 휌 ( q , x ) Computing 휌 k ( q , x ) costs k . Let C q , r , p 1 ,..., p k denote all the discarded points in X : { x ∈ X ∣ 휌 k ( q , x ) > r } Let n = ∣ X ∣ . Total cost: k + n − ∣ C q , r , p 1 ,..., p k ∣ . Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Framework Overview Concentration of Measure Our Work Statistical Learning Theory Discussion Asymptotic Bounds Concentration of Measure A function f : Ω → ℝ is 1-Lipschitz if ∣ f ( 휔 1 ) − f ( 휔 2 ) ∣ ⩽ 휌 ( 휔 1 , 휔 2 ) ∀ 휔 1 , 휔 2 ∈ Ω Examples: f ( x ) = x f ( x ) = 1 2 x ( x 2 + 1 ) √ f ( x ) = Its median is a number M such that 휇 { 휔 ∣ f ( 휔 ) ⩽ M } ⩾ 1 / 2 and 휇 { 휔 ∣ f ( 휔 ) ⩾ M } ⩾ 1 / 2 Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Framework Overview Concentration of Measure Our Work Statistical Learning Theory Discussion Asymptotic Bounds Concentration of Measure A function f : Ω → ℝ is 1-Lipschitz if ∣ f ( 휔 1 ) − f ( 휔 2 ) ∣ ⩽ 휌 ( 휔 1 , 휔 2 ) ∀ 휔 1 , 휔 2 ∈ Ω Examples: f ( x ) = x f ( x ) = 1 2 x ( x 2 + 1 ) √ f ( x ) = Its median is a number M such that 휇 { 휔 ∣ f ( 휔 ) ⩽ M } ⩾ 1 / 2 and 휇 { 휔 ∣ f ( 휔 ) ⩾ M } ⩾ 1 / 2 Ilya Volnyansky, Vladimir Pestov Curse of Dimensionality in Pivot-based Indexes
Recommend
More recommend