neighbor graph
play

neighbor graph Artur Czumaj DIMAP (Centre for Discrete Mathematics - PowerPoint PPT Presentation

Sublinear-time approximation of the cost of a metric k -nearest neighbor graph Artur Czumaj DIMAP (Centre for Discrete Mathematics and it Applications) Department of Computer Science University of Warwick Joint work with Christian Sohler


  1. Sublinear-time approximation of the cost of a metric k -nearest neighbor graph Artur Czumaj DIMAP (Centre for Discrete Mathematics and it Applications) Department of Computer Science University of Warwick Joint work with Christian Sohler

  2. k-nearest neighbor graph The 𝑙 -nearest neighbor ( 𝑙 -NN) graph is a basic data structure with applications in – spectral clustering (and unsupervised learning) – non-linear dimensionality reduction – density estimation – and many more … Computing a 𝑙 -NN graph takes Ξ©(π‘œ 2 ) time in the worst case – many data structures for approximate nearest neighbor queries for specific distances measures known – heuristics

  3. k -NN graph A k -NN graph for a set of objects with a distance measure is a directed graph, such that β€’ every vertex corresponds to exactly one object β€’ every vertex has exactly 𝑙 outgoing edges β€’ the outgoing edges point to its 𝑙 closest neighbors Every point 𝑣 connects to 𝑙 nearest neighbors

  4. Access to the input: Distance oracle model Distance Oracle Model for Metric Spaces: β€’ Input: π‘œ -point metric space (π‘Œ, 𝑒) – we will often view the metric space as a complete edge- weighted graph 𝐼 = π‘Š, 𝐹 with π‘Š = {1, … , π‘œ} , and weights satisfying triangle inequality Query: return distance between any 𝑣 and 𝑀 time ~ number of queries

  5. 𝑙 -NN graph cost approximation Problem: Given oracle access to an π‘œ -point metric space, compute a (1 + 𝜁) -approximation to the cost (denoted by cost( π‘Œ )) of the 𝑙 -NN graph cost( π‘Œ ) denotes the sum of edge weights

  6. (Simple) lower bounds Observation Computing a 𝑙 -NN graph requires Ξ©(π‘œ 2 ) queries Proof: Consider a (1,2) -metric with all distances 2 except for a single random pair with distance 1 β€’ we must find this edge to compute 𝑙 -NN graph β€’ finding a random edge requires Ξ©(π‘œ 2 ) time 1

  7. (Simple) lower bounds Observation Finding a (2 βˆ’ 𝜁) -approx. 𝑙 -NN requires Ξ© π‘œ 2 /log π‘œ queries Proof: β€’ Take a random (1,2)-metric where each distance is chosen independently at random to be 1 with 𝑙 log π‘œ probability Θ( ) π‘œ β€’ Whp. every vertex has β‰₯ 𝑙 neighbors with distance 1 β€’ (2 βˆ’ 𝜁) -approximate 𝑙 -NN contains Ξ©(π‘œπ‘™) of the 1s

  8. (Simple) lower bounds Observation Approximating the cost of 1 -NN requires Ξ©(π‘œ 2 ) queries Proof: β€’ Via lower bound for perfect matching [Badoiu, C, Indyk, Sohler, ICALP’05 ] β€’ Pick a random perfect matching and set all matching distances to 0 and all other distances to 1 β€’ Distinguish this from an instance where one matching edge has value 1 β€’ Finding this edge requires Ξ©(π‘œ 2 ) queries

  9. (Simple) lower bounds distance 0 or 1 All matching edges except possibly a single one have weights 0 One random edge may have weight 1 All non-matching distances are 1 1-NN cost is either 0 or 1 depending on a single edge β€’ Pick a random perfect matching and set all matching distances to 0 and all other distances to 1 β€’ Distinguish this from an instance where one matching edge has value 1 β€’ Finding this edge requires Ξ©(π‘œ 2 ) queries

  10. (Simple) lower bounds Observation Approximating the cost of 1 -NN requires Ξ©(π‘œ 2 ) queries Proof: β€’ Via lower bound for perfect matching [Badoiu, C, Indyk, Sohler, ICALP’05 ] β€’ Pick a random perfect matching and set all matching distances to 0 and all other distances to 1 β€’ Distinguish this from an instance where one matching edge has value 1 β€’ Finding this edge requires Ξ©(π‘œ 2 ) queries

  11. Approximating cost(X) The β€œsimple lower bounds” show that β€’ finding a low-cost 𝑙 -NN graph is hopeless β€’ estimating cost(X) is hopeless – at least for small 𝑙 Can we do anything?

  12. Approximating cost(X) The β€œsimple lower bounds” show that β€’ finding a low-cost 𝑙 -NN graph is hopeless β€’ estimating cost(X) is hopeless – at least for small 𝑙 A similar situation has been known for some other problems: MST, degree estimation, etc Chazelle et al: MST cost in graphs can be (1 + 𝜁) -approx. in poly 𝑋𝑒/𝜁 time, d-max-deg, W-max-weight C. Sohler: MST cost in metric case – approx. in ΰ·¨ 𝑃(π‘œ) time

  13. Approximating cost(X): New results Theorem 1 (1 + 𝜁) -approximation of cost(X) with ΰ·¨ 𝑃(π‘œ 2 /π‘™πœ 2 ) queries Theorem 2 Ξ©(π‘œ 2 /𝑙) queries are necessary to approximate cost(X) within any constant factor

  14. Approximating cost(X): New results Theorem 1 (1 + 𝜁) -approximation of cost(X) with ΰ·¨ 𝑃(π‘œ 2 /π‘™πœ 2 ) queries Theorem 2 Ξ©(π‘œ 2 /𝑙) queries are necessary to approximate cost(X) within any constant factor This is very bad for small 𝑙 ; and very good for large 𝑙 . What can we do for small 𝑙 ?

  15. Approximating cost(X): New results The lower bound for small 𝑙 holds when the instances are very β€œspread - out” and are β€œdisjoint” Can we get a faster algorithm when we allow the approximation guarantee to depend on the MST cost ?

  16. Approximating cost(X): New results Theorem 3 With ΰ·¨ 𝑃 𝜁 (π‘œπ‘™ 3/2 ) queries one can approximate cost(X) with error 𝜁 β‹… (cost(X) + MST(X)) Corollary 4 With ΰ·¨ 𝑃 𝜁 (min{π‘œπ‘™ 3/2 , π‘œ 2 /𝑙}) queries one can approximate cost(X) with error 𝜁 β‹… (cost(X) + MST(X)) Theorem 5 Any algorithm that approximates cost(X) with error 𝜁 β‹… cost(X) + MST(X) requires Ξ©(min{π‘œπ‘™ 3/2 /𝜁, π‘œ 2 /𝑙}) queries.

  17. Approximating cost(X): New results We have tight bounds for estimation of cost(X) When we want a (1 + 𝜁) – approximation: β€’ ΰ·© Θ 𝜁 (n 2 /k) queries are sufficient and necessary When happy with 𝜁(cost(X) + MST(X)) additive error: β€’ ΰ·© Θ 𝜁 (min{π‘œπ‘™ 3/2 , π‘œ 2 /𝑙}) queries are sufficient/necessary β€’ it’s sublinear for every 𝑙 , always at most ΰ·¨ 𝑃 𝜁 (π‘œ 8/5 )

  18. Approximating cost(X): New results We have tight bounds for estimation of cost(X) When we want a (1 + 𝜁) – approximation: β€’ ΰ·© Θ 𝜁 (n 2 /k) queries are sufficient and necessary When happy with 𝜁(cost(X) + MST(X)) additive error: β€’ ΰ·© Θ 𝜁 (min{π‘œπ‘™ 3/2 , π‘œ 2 /𝑙}) queries are sufficient/necessary Techniques: β€’ Upper bounds: clever random sampling β€’ Lower bounds: analysis of some clustering inputs (more complex part)

  19. Approximating cost(X): New results We have tight bounds for estimation of cost(X) When we want a (1 + 𝜁) – approximation: β€’ ΰ·© Θ 𝜁 (n 2 /k) queries are sufficient and necessary Efficient for large 𝑙 Relies on a random sampling: of close neighbors and far neighbors in 𝑙 -NN graph

  20. Upper bound for (𝟐 + 𝜻) -approximation Two β€œhard” instances β€’ A cluster of π‘œ βˆ’ 1 points and a single point far away β€’ A cluster of π‘œ βˆ’ (𝑙 + 1) points and 𝑙 + 1 points far away and close to each other Our algorithm must be able to distinguish between these two instances

  21. Upper bound for (𝟐 + 𝜻) -approximation Each point 𝑣 approximates length to its 𝑙/2 th – nearest neighbor 𝑛 𝑣 𝑃(π‘œ/𝑙) queries per point; ΰ·¨ ΰ·¨ 𝑃(π‘œ 2 /𝑙) queries in total β€’ Short edges in 𝑙 -NN: (𝑣, 𝑀) s.t. 𝑒 𝑣, 𝑀 ≀ 10𝑛 𝑣 Long edges in 𝑙 -NN: all other edges We separately estimate total length of short edges and total length of long edges

  22. Upper bound for (𝟐 + 𝜻) -approximation Short edges in 𝑙 -NN: (𝑣, 𝑀) s.t. 𝑒 𝑣, 𝑀 ≀ 10𝑛 𝑣 Long edges in 𝑙 -NN: all other edges β€’ We separately estimate total length of short edges and total length of long edges by some random sampling methods Summing estimations of short and long edges gives a (1 + 𝜁) – approximation of cost(X)

  23. Lower bound for (𝟐 + 𝜻) -approximation Consider two problem instances: – inner-cluster distance ~ 0; outer-cluster distance ~1 π‘œ 𝑙+1 clusters of size 𝑙 + 1 each β€’ – cost(X)~0 π‘œ 𝑙+1 βˆ’ 1 clusters of size 𝑙 + 1 each; one cluster of size β€’ 𝑙 , one cluster of size 1 – cost(Y) ≫ 0 We prove that one requires Ξ©(π‘œ 2 /𝑙) queries to distinguish between these two problem instances

  24. Lower bound for (𝟐 + 𝜻) -approximation clusters of size 𝑙 + 1 each cost(X) ~ 0

  25. Lower bound for (𝟐 + 𝜻) -approximation clusters of size 𝑙 + 1 each cost(X) ~ 0 In other instance: remove a random point from its cluster cost(Y) ≫ 0

  26. Lower bound for (𝟐 + 𝜻) -approximation clusters of size 𝑙 + 1 each cost(X) ~ 0 In other instance: remove a random point from its cluster To find that single point one needs Ξ©(π‘œ 2 /𝑙) queries β€’ 𝑃(π‘œ) samples to hit it first β€’ 𝑃(π‘œ/𝑙) further samples to detect no neighbors cost(Y) ≫ 0

  27. Lower bound for (𝟐 + 𝜻) -approximation Consider two problem instances: – inner-cluster distance ~ 0; outer-cluster distance ~1 π‘œ 𝑙+1 clusters of size 𝑙 + 1 each β€’ – cost(X)~0 π‘œ 𝑙+1 βˆ’ 1 clusters of size 𝑙 + 1 each; one cluster of size β€’ 𝑙 , one cluster of size 1 – cost(Y) ≫ 0 We prove that one requires Ξ©(π‘œ 2 /𝑙) queries to distinguish between these two problem instances

  28. Approximating with error 𝜻( cost(X) + MST(X) ) Simplifying assumptions: β€’ All distances are of the form 1 + 𝜁 𝑗 β€’ All distances are between 1 and poly( π‘œ )

Recommend


More recommend