SLIDE 1 Artur Czumaj
DIMAP (Centre for Discrete Mathematics and it Applications) Department of Computer Science
University of Warwick
Joint work with Christian Sohler
Sublinear-time approximation of the cost of a metric k-nearest neighbor graph
SLIDE 2
k-nearest neighbor graph
The π-nearest neighbor (π-NN) graph is a basic data structure with applications in
β spectral clustering (and unsupervised learning) β non-linear dimensionality reduction β density estimation β and many moreβ¦
Computing a π-NN graph takes Ξ©(π2) time in the worst case
β many data structures for approximate nearest neighbor queries for specific distances measures known β heuristics
SLIDE 3 k-NN graph
A k-NN graph for a set of objects with a distance measure is a directed graph, such that
- every vertex corresponds to exactly one object
- every vertex has exactly π outgoing edges
- the outgoing edges point to its π closest neighbors
Every point π£ connects to π nearest neighbors
SLIDE 4 Access to the input: Distance oracle model
Distance Oracle Model for Metric Spaces:
- Input: π-point metric space (π, π)
β we will often view the metric space as a complete edge- weighted graph πΌ = π, πΉ with π = {1, β¦ , π}, and weights satisfying triangle inequality
Query: return distance between any π£ and π€ time ~ number of queries
SLIDE 5 π-NN graph cost approximation
Problem: Given oracle access to an π-point metric space, compute a (1 + π)-approximation to the cost (denoted by cost(π))
cost(π) denotes the sum of edge weights
SLIDE 6 (Simple) lower bounds
Observation Computing a π-NN graph requires Ξ©(π2) queries Proof: Consider a (1,2)-metric with all distances 2 except for a single random pair with distance 1
- we must find this edge to compute π-NN graph
- finding a random edge requires Ξ©(π2) time
1
SLIDE 7 (Simple) lower bounds
Observation Finding a (2 β π)-approx. π-NN requires Ξ© π2/log π queries Proof:
- Take a random (1,2)-metric where each distance is
chosen independently at random to be 1 with probability Ξ(
π log π π
)
- Whp. every vertex has β₯ π neighbors with distance 1
- (2 β π)-approximate π-NN contains Ξ©(ππ) of the 1s
SLIDE 8 (Simple) lower bounds
Observation Approximating the cost of 1-NN requires Ξ©(π2) queries Proof:
- Via lower bound for perfect matching [Badoiu, C, Indyk, Sohler, ICALPβ05]
- Pick a random perfect matching and set all matching
distances to 0 and all other distances to 1
- Distinguish this from an instance where one matching
edge has value 1
- Finding this edge requires Ξ©(π2) queries
SLIDE 9 (Simple) lower bounds
- Pick a random perfect matching and set all matching
distances to 0 and all other distances to 1
- Distinguish this from an instance where one matching
edge has value 1
- Finding this edge requires Ξ©(π2) queries
All matching edges except possibly a single one have weights 0 One random edge may have weight 1 All non-matching distances are 1
distance 0 or 1
1-NN cost is either 0 or 1 depending on a single edge
SLIDE 10 (Simple) lower bounds
Observation Approximating the cost of 1-NN requires Ξ©(π2) queries Proof:
- Via lower bound for perfect matching [Badoiu, C, Indyk, Sohler, ICALPβ05]
- Pick a random perfect matching and set all matching
distances to 0 and all other distances to 1
- Distinguish this from an instance where one matching
edge has value 1
- Finding this edge requires Ξ©(π2) queries
SLIDE 11 Approximating cost(X)
The βsimple lower boundsβ show that
- finding a low-cost π-NN graph is hopeless
- estimating cost(X) is hopeless
β at least for small π
Can we do anything?
SLIDE 12 Approximating cost(X)
The βsimple lower boundsβ show that
- finding a low-cost π-NN graph is hopeless
- estimating cost(X) is hopeless
β at least for small π
A similar situation has been known for some other problems: MST, degree estimation, etc Chazelle et al: MST cost in graphs can be (1 + π)-approx. in poly ππ/π time, d-max-deg, W-max-weight
- C. Sohler: MST cost in metric case β approx. in ΰ·¨
π(π) time
SLIDE 13
Approximating cost(X): New results
Theorem 1 (1 + π)-approximation of cost(X) with ΰ·¨ π(π2/ππ2) queries Theorem 2 Ξ©(π2/π) queries are necessary to approximate cost(X) within any constant factor
SLIDE 14
Approximating cost(X): New results
Theorem 1 (1 + π)-approximation of cost(X) with ΰ·¨ π(π2/ππ2) queries Theorem 2 Ξ©(π2/π) queries are necessary to approximate cost(X) within any constant factor This is very bad for small π; and very good for large π. What can we do for small π?
SLIDE 15
Approximating cost(X): New results
The lower bound for small π holds when the instances are very βspread-outβ and are βdisjointβ Can we get a faster algorithm when we allow the approximation guarantee to depend on the MST cost?
SLIDE 16
Approximating cost(X): New results
Theorem 3 With ΰ·¨ ππ(ππ3/2) queries one can approximate cost(X) with error π β
(cost(X) + MST(X)) Corollary 4 With ΰ·¨ ππ(min{ππ3/2, π2/π}) queries one can approximate cost(X) with error π β
(cost(X) + MST(X)) Theorem 5 Any algorithm that approximates cost(X) with error π β
cost(X) + MST(X) requires Ξ©(min{ππ3/2/π, π2/π}) queries.
SLIDE 17 Approximating cost(X): New results
We have tight bounds for estimation of cost(X) When we want a (1 + π)βapproximation:
Ξπ(n2/k) queries are sufficient and necessary When happy with π(cost(X) + MST(X)) additive error:
Ξπ(min{ππ3/2, π2/π}) queries are sufficient/necessary
- itβs sublinear for every π, always at most ΰ·¨
ππ(π8/5)
SLIDE 18 Approximating cost(X): New results
We have tight bounds for estimation of cost(X) When we want a (1 + π)βapproximation:
Ξπ(n2/k) queries are sufficient and necessary When happy with π(cost(X) + MST(X)) additive error:
Ξπ(min{ππ3/2, π2/π}) queries are sufficient/necessary
Techniques:
- Upper bounds: clever random sampling
- Lower bounds: analysis of some clustering inputs (more complex part)
SLIDE 19 Approximating cost(X): New results
We have tight bounds for estimation of cost(X) When we want a (1 + π)βapproximation:
Ξπ(n2/k) queries are sufficient and necessary Efficient for large π Relies on a random sampling: of close neighbors and far neighbors in π-NN graph
SLIDE 20 Upper bound for (π + π»)-approximation
Two βhardβ instances
- A cluster of π β 1 points and a single point far away
- A cluster of π β (π + 1) points and π + 1 points far
away and close to each other Our algorithm must be able to distinguish between these two instances
SLIDE 21 Upper bound for (π + π»)-approximation
Each point π£ approximates length to its π/2thβnearest neighbor ππ£
π(π/π) queries per point; ΰ·¨ π(π2/π) queries in total Short edges in π-NN: (π£, π€) s.t. π π£, π€ β€ 10ππ£ Long edges in π-NN: all other edges We separately estimate total length of short edges and total length of long edges
SLIDE 22 Upper bound for (π + π»)-approximation
Short edges in π-NN: (π£, π€) s.t. π π£, π€ β€ 10ππ£ Long edges in π-NN: all other edges
- We separately estimate total length of short edges and
total length of long edges by some random sampling methods Summing estimations of short and long edges gives a (1 + π)βapproximation of cost(X)
SLIDE 23 Lower bound for (π + π»)-approximation
Consider two problem instances:
β inner-cluster distance ~ 0; outer-cluster distance ~1
π+1 clusters of size π + 1 each
β cost(X)~0
π+1 β 1 clusters of size π + 1 each; one cluster of size
π, one cluster of size 1
β cost(Y) β« 0
We prove that one requires Ξ©(π2/π) queries to distinguish between these two problem instances
SLIDE 24
Lower bound for (π + π»)-approximation
clusters of size π + 1 each cost(X) ~ 0
SLIDE 25
Lower bound for (π + π»)-approximation
clusters of size π + 1 each cost(X) ~ 0 In other instance: remove a random point from its cluster cost(Y) β« 0
SLIDE 26 Lower bound for (π + π»)-approximation
clusters of size π + 1 each cost(X) ~ 0 In other instance: remove a random point from its cluster cost(Y) β« 0 To find that single point one needs Ξ©(π2/π) queries
- π(π) samples to hit it first
- π(π/π) further samples to detect no neighbors
SLIDE 27 Lower bound for (π + π»)-approximation
Consider two problem instances:
β inner-cluster distance ~ 0; outer-cluster distance ~1
π+1 clusters of size π + 1 each
β cost(X)~0
π+1 β 1 clusters of size π + 1 each; one cluster of size
π, one cluster of size 1
β cost(Y) β« 0
We prove that one requires Ξ©(π2/π) queries to distinguish between these two problem instances
SLIDE 28 Approximating with error π»(cost(X) + MST(X))
Simplifying assumptions:
- All distances are of the form 1 + π π
- All distances are between 1 and poly(π)
SLIDE 29 The cost of a π-NN graph & threshold graphs
Threshold graphs [Chazelle, Rubinfeld, Trevisan, SICOMP 2005;
C, Sohler, SICOMP 2009]
Let π» be the π-NN graph of an π-point metric space
- π»(π) = (π, πΉ(π)) contain edges of π» of weight β€ 1 + π π
- deg(π) π£ = outdegree of π£ in π»(π)
Simple formula for π-NN cost: cost π = ππ + π ΰ·
π
1 + π π β
ΰ·
π£
(π β deg π (π£))
SLIDE 30
The cost of a π-NN graph
cost π = ππ + π ΰ·
π
1 + π π β
ΰ·
π£
(π β deg π (π£)) Implies that it suffices to estimate Οπ£(π β deg π (π£))
SLIDE 31 The cost of a π-NN graph
cost π = ππ + π ΰ·
π
1 + π π β
ΰ·
π£
(π β deg π (π£)) Implies that it suffices to estimate Οπ£(π β deg π (π£))
- deg π (π£) is given implicitly only; we need to sample
distances to all neighbors to compute deg π (π£) exactly
SLIDE 32 The cost of a π-NN graph
cost π = ππ + π ΰ·
π
1 + π π β
ΰ·
π£
(π β deg π (π£)) Implies that it suffices to estimate Οπ£(π β deg π (π£)) What about simple random sampling?
sample π‘ points and for each sample π of their neighbors to estimate π β deg π (π£)
Wonβt work well: β Cluster with π β 1 points and a single outlier β We need to find the single outlier β Sample size must be π(π)
SLIDE 33 The cost of a π-NN graph
cost π = ππ + π ΰ·
π
1 + π π β
ΰ·
π£
(π β deg π (π£)) Implies that it suffices to estimate Οπ£(π β deg π (π£)) Idea (inspired by MST approximation due to C&Sohler, 2009):
- If π£ has many (β« π) neighbors of distance β€ 1 + π π
ο¨ π£ doesnβt contribute to our cost function
β random sampling can detect it
- Otherwise, we can afford slower random sampling
SLIDE 34 Approximating number of vertices of given degree
For a given point π£, sample vertices uniformly at random
- If the number of sampled vertices with distance β€
1 + π π to π£ is sufficiently large then π£ is not-useful
- Otherwise, double the sample size and repeat until it
becomes π‘ = π(π log π /π)
β return π£ as useful
- If π£ has π points at distance β€ 1 + π π, then w.h.p., in
expected time π(π log π/ (π + π)):
β if π β₯ 4π then we mark π£ as non-useful β If π β€ π then we mark π£ as useful
SLIDE 35 Approximating number of vertices of given degree
Using this approach, we can take a sample of size π(π 1 + π π/ππ(1)MST(X)) to get a desired error Expected running time evaluation on a randomly chosen vertex is π(π log π πππ(π)/ 1 + π π) Theorem We can approximate the cost of a π-NN graph in time π(ππ2/ππ(1)) with an error of π cost(X) + πππ(π)
- π2 can be improved to π3/2
SLIDE 36
Lower bound
clusters of size π + 1 each cost(X) ~ 0, MST(X) ~ π/π
SLIDE 37
Lower bound
clusters of size π + 1 each cost(X) ~ 0, MST(X) ~ π/π In other instance: in some clusters move π points to some other cluster
SLIDE 38
Lower bound
clusters of size π + 1 each cost(X) ~ 0, MST(X) ~ π/π cost(Y) β« 0, MST(Y) ~ π/π In other instance: in some clusters move π points to some other cluster
SLIDE 39
Lower bound
clusters of size π + 1 each cost(X) ~ 0, MST(X) ~ π/π cost(Y) β« 0, MST(Y) ~ π/π To distinguish between the instances one needs Ξ©π(ππ3/2) queries (for small π); for large π itβs still Ξ©(π2/π) queries In other instance: in some clusters move π points to some other cluster
SLIDE 40 Approximating cost(X): Summary of results
We have tight bounds for estimation of cost(X) When we want a (1 + π)βapproximation:
Ξπ(n2/k) queries are sufficient and necessary When happy with π(cost(X) + MST(X)) additive error:
Ξπ(min{ππ3/2, π2/π}) queries are sufficient/necessary
SLIDE 41 Open problems
Sublinear algorithms in metric spaces:
- more problems approximated in sublinear-time
- if we add some other parameters to the error bound
(like MST) β can we approximate more problems in sublinear time
SLIDE 42
THANK YOU!