Sublinear-time approximation of the cost of a metric k -nearest neighbor graph Artur Czumaj DIMAP (Centre for Discrete Mathematics and it Applications) Department of Computer Science University of Warwick Joint work with Christian Sohler
k-nearest neighbor graph The π -nearest neighbor ( π -NN) graph is a basic data structure with applications in β spectral clustering (and unsupervised learning) β non-linear dimensionality reduction β density estimation β and many more β¦ Computing a π -NN graph takes Ξ©(π 2 ) time in the worst case β many data structures for approximate nearest neighbor queries for specific distances measures known β heuristics
k -NN graph A k -NN graph for a set of objects with a distance measure is a directed graph, such that β’ every vertex corresponds to exactly one object β’ every vertex has exactly π outgoing edges β’ the outgoing edges point to its π closest neighbors Every point π£ connects to π nearest neighbors
Access to the input: Distance oracle model Distance Oracle Model for Metric Spaces: β’ Input: π -point metric space (π, π) β we will often view the metric space as a complete edge- weighted graph πΌ = π, πΉ with π = {1, β¦ , π} , and weights satisfying triangle inequality Query: return distance between any π£ and π€ time ~ number of queries
π -NN graph cost approximation Problem: Given oracle access to an π -point metric space, compute a (1 + π) -approximation to the cost (denoted by cost( π )) of the π -NN graph cost( π ) denotes the sum of edge weights
(Simple) lower bounds Observation Computing a π -NN graph requires Ξ©(π 2 ) queries Proof: Consider a (1,2) -metric with all distances 2 except for a single random pair with distance 1 β’ we must find this edge to compute π -NN graph β’ finding a random edge requires Ξ©(π 2 ) time 1
(Simple) lower bounds Observation Finding a (2 β π) -approx. π -NN requires Ξ© π 2 /log π queries Proof: β’ Take a random (1,2)-metric where each distance is chosen independently at random to be 1 with π log π probability Ξ( ) π β’ Whp. every vertex has β₯ π neighbors with distance 1 β’ (2 β π) -approximate π -NN contains Ξ©(ππ) of the 1s
(Simple) lower bounds Observation Approximating the cost of 1 -NN requires Ξ©(π 2 ) queries Proof: β’ Via lower bound for perfect matching [Badoiu, C, Indyk, Sohler, ICALPβ05 ] β’ Pick a random perfect matching and set all matching distances to 0 and all other distances to 1 β’ Distinguish this from an instance where one matching edge has value 1 β’ Finding this edge requires Ξ©(π 2 ) queries
(Simple) lower bounds distance 0 or 1 All matching edges except possibly a single one have weights 0 One random edge may have weight 1 All non-matching distances are 1 1-NN cost is either 0 or 1 depending on a single edge β’ Pick a random perfect matching and set all matching distances to 0 and all other distances to 1 β’ Distinguish this from an instance where one matching edge has value 1 β’ Finding this edge requires Ξ©(π 2 ) queries
(Simple) lower bounds Observation Approximating the cost of 1 -NN requires Ξ©(π 2 ) queries Proof: β’ Via lower bound for perfect matching [Badoiu, C, Indyk, Sohler, ICALPβ05 ] β’ Pick a random perfect matching and set all matching distances to 0 and all other distances to 1 β’ Distinguish this from an instance where one matching edge has value 1 β’ Finding this edge requires Ξ©(π 2 ) queries
Approximating cost(X) The βsimple lower boundsβ show that β’ finding a low-cost π -NN graph is hopeless β’ estimating cost(X) is hopeless β at least for small π Can we do anything?
Approximating cost(X) The βsimple lower boundsβ show that β’ finding a low-cost π -NN graph is hopeless β’ estimating cost(X) is hopeless β at least for small π A similar situation has been known for some other problems: MST, degree estimation, etc Chazelle et al: MST cost in graphs can be (1 + π) -approx. in poly ππ/π time, d-max-deg, W-max-weight C. Sohler: MST cost in metric case β approx. in ΰ·¨ π(π) time
Approximating cost(X): New results Theorem 1 (1 + π) -approximation of cost(X) with ΰ·¨ π(π 2 /ππ 2 ) queries Theorem 2 Ξ©(π 2 /π) queries are necessary to approximate cost(X) within any constant factor
Approximating cost(X): New results Theorem 1 (1 + π) -approximation of cost(X) with ΰ·¨ π(π 2 /ππ 2 ) queries Theorem 2 Ξ©(π 2 /π) queries are necessary to approximate cost(X) within any constant factor This is very bad for small π ; and very good for large π . What can we do for small π ?
Approximating cost(X): New results The lower bound for small π holds when the instances are very βspread - outβ and are βdisjointβ Can we get a faster algorithm when we allow the approximation guarantee to depend on the MST cost ?
Approximating cost(X): New results Theorem 3 With ΰ·¨ π π (ππ 3/2 ) queries one can approximate cost(X) with error π β (cost(X) + MST(X)) Corollary 4 With ΰ·¨ π π (min{ππ 3/2 , π 2 /π}) queries one can approximate cost(X) with error π β (cost(X) + MST(X)) Theorem 5 Any algorithm that approximates cost(X) with error π β cost(X) + MST(X) requires Ξ©(min{ππ 3/2 /π, π 2 /π}) queries.
Approximating cost(X): New results We have tight bounds for estimation of cost(X) When we want a (1 + π) β approximation: β’ ΰ·© Ξ π (n 2 /k) queries are sufficient and necessary When happy with π(cost(X) + MST(X)) additive error: β’ ΰ·© Ξ π (min{ππ 3/2 , π 2 /π}) queries are sufficient/necessary β’ itβs sublinear for every π , always at most ΰ·¨ π π (π 8/5 )
Approximating cost(X): New results We have tight bounds for estimation of cost(X) When we want a (1 + π) β approximation: β’ ΰ·© Ξ π (n 2 /k) queries are sufficient and necessary When happy with π(cost(X) + MST(X)) additive error: β’ ΰ·© Ξ π (min{ππ 3/2 , π 2 /π}) queries are sufficient/necessary Techniques: β’ Upper bounds: clever random sampling β’ Lower bounds: analysis of some clustering inputs (more complex part)
Approximating cost(X): New results We have tight bounds for estimation of cost(X) When we want a (1 + π) β approximation: β’ ΰ·© Ξ π (n 2 /k) queries are sufficient and necessary Efficient for large π Relies on a random sampling: of close neighbors and far neighbors in π -NN graph
Upper bound for (π + π») -approximation Two βhardβ instances β’ A cluster of π β 1 points and a single point far away β’ A cluster of π β (π + 1) points and π + 1 points far away and close to each other Our algorithm must be able to distinguish between these two instances
Upper bound for (π + π») -approximation Each point π£ approximates length to its π/2 th β nearest neighbor π π£ π(π/π) queries per point; ΰ·¨ ΰ·¨ π(π 2 /π) queries in total β’ Short edges in π -NN: (π£, π€) s.t. π π£, π€ β€ 10π π£ Long edges in π -NN: all other edges We separately estimate total length of short edges and total length of long edges
Upper bound for (π + π») -approximation Short edges in π -NN: (π£, π€) s.t. π π£, π€ β€ 10π π£ Long edges in π -NN: all other edges β’ We separately estimate total length of short edges and total length of long edges by some random sampling methods Summing estimations of short and long edges gives a (1 + π) β approximation of cost(X)
Lower bound for (π + π») -approximation Consider two problem instances: β inner-cluster distance ~ 0; outer-cluster distance ~1 π π+1 clusters of size π + 1 each β’ β cost(X)~0 π π+1 β 1 clusters of size π + 1 each; one cluster of size β’ π , one cluster of size 1 β cost(Y) β« 0 We prove that one requires Ξ©(π 2 /π) queries to distinguish between these two problem instances
Lower bound for (π + π») -approximation clusters of size π + 1 each cost(X) ~ 0
Lower bound for (π + π») -approximation clusters of size π + 1 each cost(X) ~ 0 In other instance: remove a random point from its cluster cost(Y) β« 0
Lower bound for (π + π») -approximation clusters of size π + 1 each cost(X) ~ 0 In other instance: remove a random point from its cluster To find that single point one needs Ξ©(π 2 /π) queries β’ π(π) samples to hit it first β’ π(π/π) further samples to detect no neighbors cost(Y) β« 0
Lower bound for (π + π») -approximation Consider two problem instances: β inner-cluster distance ~ 0; outer-cluster distance ~1 π π+1 clusters of size π + 1 each β’ β cost(X)~0 π π+1 β 1 clusters of size π + 1 each; one cluster of size β’ π , one cluster of size 1 β cost(Y) β« 0 We prove that one requires Ξ©(π 2 /π) queries to distinguish between these two problem instances
Approximating with error π»( cost(X) + MST(X) ) Simplifying assumptions: β’ All distances are of the form 1 + π π β’ All distances are between 1 and poly( π )
Recommend
More recommend