neighbor graph Artur Czumaj DIMAP (Centre for Discrete Mathematics - - PowerPoint PPT Presentation

β–Ά
neighbor graph
SMART_READER_LITE
LIVE PREVIEW

neighbor graph Artur Czumaj DIMAP (Centre for Discrete Mathematics - - PowerPoint PPT Presentation

Sublinear-time approximation of the cost of a metric k -nearest neighbor graph Artur Czumaj DIMAP (Centre for Discrete Mathematics and it Applications) Department of Computer Science University of Warwick Joint work with Christian Sohler


slide-1
SLIDE 1

Artur Czumaj

DIMAP (Centre for Discrete Mathematics and it Applications) Department of Computer Science

University of Warwick

Joint work with Christian Sohler

Sublinear-time approximation of the cost of a metric k-nearest neighbor graph

slide-2
SLIDE 2

k-nearest neighbor graph

The 𝑙-nearest neighbor (𝑙-NN) graph is a basic data structure with applications in

– spectral clustering (and unsupervised learning) – non-linear dimensionality reduction – density estimation – and many more…

Computing a 𝑙-NN graph takes Ξ©(π‘œ2) time in the worst case

– many data structures for approximate nearest neighbor queries for specific distances measures known – heuristics

slide-3
SLIDE 3

k-NN graph

A k-NN graph for a set of objects with a distance measure is a directed graph, such that

  • every vertex corresponds to exactly one object
  • every vertex has exactly 𝑙 outgoing edges
  • the outgoing edges point to its 𝑙 closest neighbors

Every point 𝑣 connects to 𝑙 nearest neighbors

slide-4
SLIDE 4

Access to the input: Distance oracle model

Distance Oracle Model for Metric Spaces:

  • Input: π‘œ-point metric space (π‘Œ, 𝑒)

– we will often view the metric space as a complete edge- weighted graph 𝐼 = π‘Š, 𝐹 with π‘Š = {1, … , π‘œ}, and weights satisfying triangle inequality

Query: return distance between any 𝑣 and 𝑀 time ~ number of queries

slide-5
SLIDE 5

𝑙-NN graph cost approximation

Problem: Given oracle access to an π‘œ-point metric space, compute a (1 + 𝜁)-approximation to the cost (denoted by cost(π‘Œ))

  • f the 𝑙-NN graph

cost(π‘Œ) denotes the sum of edge weights

slide-6
SLIDE 6

(Simple) lower bounds

Observation Computing a 𝑙-NN graph requires Ξ©(π‘œ2) queries Proof: Consider a (1,2)-metric with all distances 2 except for a single random pair with distance 1

  • we must find this edge to compute 𝑙-NN graph
  • finding a random edge requires Ξ©(π‘œ2) time

1

slide-7
SLIDE 7

(Simple) lower bounds

Observation Finding a (2 βˆ’ 𝜁)-approx. 𝑙-NN requires Ξ© π‘œ2/log π‘œ queries Proof:

  • Take a random (1,2)-metric where each distance is

chosen independently at random to be 1 with probability Θ(

𝑙 log π‘œ π‘œ

)

  • Whp. every vertex has β‰₯ 𝑙 neighbors with distance 1
  • (2 βˆ’ 𝜁)-approximate 𝑙-NN contains Ξ©(π‘œπ‘™) of the 1s
slide-8
SLIDE 8

(Simple) lower bounds

Observation Approximating the cost of 1-NN requires Ξ©(π‘œ2) queries Proof:

  • Via lower bound for perfect matching [Badoiu, C, Indyk, Sohler, ICALP’05]
  • Pick a random perfect matching and set all matching

distances to 0 and all other distances to 1

  • Distinguish this from an instance where one matching

edge has value 1

  • Finding this edge requires Ξ©(π‘œ2) queries
slide-9
SLIDE 9

(Simple) lower bounds

  • Pick a random perfect matching and set all matching

distances to 0 and all other distances to 1

  • Distinguish this from an instance where one matching

edge has value 1

  • Finding this edge requires Ξ©(π‘œ2) queries

All matching edges except possibly a single one have weights 0 One random edge may have weight 1 All non-matching distances are 1

distance 0 or 1

1-NN cost is either 0 or 1 depending on a single edge

slide-10
SLIDE 10

(Simple) lower bounds

Observation Approximating the cost of 1-NN requires Ξ©(π‘œ2) queries Proof:

  • Via lower bound for perfect matching [Badoiu, C, Indyk, Sohler, ICALP’05]
  • Pick a random perfect matching and set all matching

distances to 0 and all other distances to 1

  • Distinguish this from an instance where one matching

edge has value 1

  • Finding this edge requires Ξ©(π‘œ2) queries
slide-11
SLIDE 11

Approximating cost(X)

The β€œsimple lower bounds” show that

  • finding a low-cost 𝑙-NN graph is hopeless
  • estimating cost(X) is hopeless

– at least for small 𝑙

Can we do anything?

slide-12
SLIDE 12

Approximating cost(X)

The β€œsimple lower bounds” show that

  • finding a low-cost 𝑙-NN graph is hopeless
  • estimating cost(X) is hopeless

– at least for small 𝑙

A similar situation has been known for some other problems: MST, degree estimation, etc Chazelle et al: MST cost in graphs can be (1 + 𝜁)-approx. in poly 𝑋𝑒/𝜁 time, d-max-deg, W-max-weight

  • C. Sohler: MST cost in metric case – approx. in ΰ·¨

𝑃(π‘œ) time

slide-13
SLIDE 13

Approximating cost(X): New results

Theorem 1 (1 + 𝜁)-approximation of cost(X) with ΰ·¨ 𝑃(π‘œ2/π‘™πœ2) queries Theorem 2 Ξ©(π‘œ2/𝑙) queries are necessary to approximate cost(X) within any constant factor

slide-14
SLIDE 14

Approximating cost(X): New results

Theorem 1 (1 + 𝜁)-approximation of cost(X) with ΰ·¨ 𝑃(π‘œ2/π‘™πœ2) queries Theorem 2 Ξ©(π‘œ2/𝑙) queries are necessary to approximate cost(X) within any constant factor This is very bad for small 𝑙; and very good for large 𝑙. What can we do for small 𝑙?

slide-15
SLIDE 15

Approximating cost(X): New results

The lower bound for small 𝑙 holds when the instances are very β€œspread-out” and are β€œdisjoint” Can we get a faster algorithm when we allow the approximation guarantee to depend on the MST cost?

slide-16
SLIDE 16

Approximating cost(X): New results

Theorem 3 With ΰ·¨ π‘ƒπœ(π‘œπ‘™3/2) queries one can approximate cost(X) with error 𝜁 β‹… (cost(X) + MST(X)) Corollary 4 With ΰ·¨ π‘ƒπœ(min{π‘œπ‘™3/2, π‘œ2/𝑙}) queries one can approximate cost(X) with error 𝜁 β‹… (cost(X) + MST(X)) Theorem 5 Any algorithm that approximates cost(X) with error 𝜁 β‹… cost(X) + MST(X) requires Ξ©(min{π‘œπ‘™3/2/𝜁, π‘œ2/𝑙}) queries.

slide-17
SLIDE 17

Approximating cost(X): New results

We have tight bounds for estimation of cost(X) When we want a (1 + 𝜁)–approximation:

  • ΰ·©

Θ𝜁(n2/k) queries are sufficient and necessary When happy with 𝜁(cost(X) + MST(X)) additive error:

  • ΰ·©

Θ𝜁(min{π‘œπ‘™3/2, π‘œ2/𝑙}) queries are sufficient/necessary

  • it’s sublinear for every 𝑙, always at most ΰ·¨

π‘ƒπœ(π‘œ8/5)

slide-18
SLIDE 18

Approximating cost(X): New results

We have tight bounds for estimation of cost(X) When we want a (1 + 𝜁)–approximation:

  • ΰ·©

Θ𝜁(n2/k) queries are sufficient and necessary When happy with 𝜁(cost(X) + MST(X)) additive error:

  • ΰ·©

Θ𝜁(min{π‘œπ‘™3/2, π‘œ2/𝑙}) queries are sufficient/necessary

Techniques:

  • Upper bounds: clever random sampling
  • Lower bounds: analysis of some clustering inputs (more complex part)
slide-19
SLIDE 19

Approximating cost(X): New results

We have tight bounds for estimation of cost(X) When we want a (1 + 𝜁)–approximation:

  • ΰ·©

Θ𝜁(n2/k) queries are sufficient and necessary Efficient for large 𝑙 Relies on a random sampling: of close neighbors and far neighbors in 𝑙-NN graph

slide-20
SLIDE 20

Upper bound for (𝟐 + 𝜻)-approximation

Two β€œhard” instances

  • A cluster of π‘œ βˆ’ 1 points and a single point far away
  • A cluster of π‘œ βˆ’ (𝑙 + 1) points and 𝑙 + 1 points far

away and close to each other Our algorithm must be able to distinguish between these two instances

slide-21
SLIDE 21

Upper bound for (𝟐 + 𝜻)-approximation

Each point 𝑣 approximates length to its 𝑙/2th–nearest neighbor 𝑛𝑣

  • ΰ·¨

𝑃(π‘œ/𝑙) queries per point; ΰ·¨ 𝑃(π‘œ2/𝑙) queries in total Short edges in 𝑙-NN: (𝑣, 𝑀) s.t. 𝑒 𝑣, 𝑀 ≀ 10𝑛𝑣 Long edges in 𝑙-NN: all other edges We separately estimate total length of short edges and total length of long edges

slide-22
SLIDE 22

Upper bound for (𝟐 + 𝜻)-approximation

Short edges in 𝑙-NN: (𝑣, 𝑀) s.t. 𝑒 𝑣, 𝑀 ≀ 10𝑛𝑣 Long edges in 𝑙-NN: all other edges

  • We separately estimate total length of short edges and

total length of long edges by some random sampling methods Summing estimations of short and long edges gives a (1 + 𝜁)–approximation of cost(X)

slide-23
SLIDE 23

Lower bound for (𝟐 + 𝜻)-approximation

Consider two problem instances:

– inner-cluster distance ~ 0; outer-cluster distance ~1

  • π‘œ

𝑙+1 clusters of size 𝑙 + 1 each

– cost(X)~0

  • π‘œ

𝑙+1 βˆ’ 1 clusters of size 𝑙 + 1 each; one cluster of size

𝑙, one cluster of size 1

– cost(Y) ≫ 0

We prove that one requires Ξ©(π‘œ2/𝑙) queries to distinguish between these two problem instances

slide-24
SLIDE 24

Lower bound for (𝟐 + 𝜻)-approximation

clusters of size 𝑙 + 1 each cost(X) ~ 0

slide-25
SLIDE 25

Lower bound for (𝟐 + 𝜻)-approximation

clusters of size 𝑙 + 1 each cost(X) ~ 0 In other instance: remove a random point from its cluster cost(Y) ≫ 0

slide-26
SLIDE 26

Lower bound for (𝟐 + 𝜻)-approximation

clusters of size 𝑙 + 1 each cost(X) ~ 0 In other instance: remove a random point from its cluster cost(Y) ≫ 0 To find that single point one needs Ξ©(π‘œ2/𝑙) queries

  • 𝑃(π‘œ) samples to hit it first
  • 𝑃(π‘œ/𝑙) further samples to detect no neighbors
slide-27
SLIDE 27

Lower bound for (𝟐 + 𝜻)-approximation

Consider two problem instances:

– inner-cluster distance ~ 0; outer-cluster distance ~1

  • π‘œ

𝑙+1 clusters of size 𝑙 + 1 each

– cost(X)~0

  • π‘œ

𝑙+1 βˆ’ 1 clusters of size 𝑙 + 1 each; one cluster of size

𝑙, one cluster of size 1

– cost(Y) ≫ 0

We prove that one requires Ξ©(π‘œ2/𝑙) queries to distinguish between these two problem instances

slide-28
SLIDE 28

Approximating with error 𝜻(cost(X) + MST(X))

Simplifying assumptions:

  • All distances are of the form 1 + 𝜁 𝑗
  • All distances are between 1 and poly(π‘œ)
slide-29
SLIDE 29

The cost of a 𝑙-NN graph & threshold graphs

Threshold graphs [Chazelle, Rubinfeld, Trevisan, SICOMP 2005;

C, Sohler, SICOMP 2009]

Let 𝐻 be the 𝑙-NN graph of an π‘œ-point metric space

  • 𝐻(𝑗) = (π‘Š, 𝐹(𝑗)) contain edges of 𝐻 of weight ≀ 1 + 𝜁 𝑗
  • deg(𝑗) 𝑣 = outdegree of 𝑣 in 𝐻(𝑗)

Simple formula for 𝑙-NN cost: cost π‘Œ = π‘œπ‘™ + 𝜁 ෍

𝑗

1 + 𝜁 𝑗 β‹… ෍

𝑣

(𝑙 βˆ’ deg 𝑗 (𝑣))

slide-30
SLIDE 30

The cost of a 𝑙-NN graph

cost π‘Œ = π‘œπ‘™ + 𝜁 ෍

𝑗

1 + 𝜁 𝑗 β‹… ෍

𝑣

(𝑙 βˆ’ deg 𝑗 (𝑣)) Implies that it suffices to estimate σ𝑣(𝑙 βˆ’ deg 𝑗 (𝑣))

slide-31
SLIDE 31

The cost of a 𝑙-NN graph

cost π‘Œ = π‘œπ‘™ + 𝜁 ෍

𝑗

1 + 𝜁 𝑗 β‹… ෍

𝑣

(𝑙 βˆ’ deg 𝑗 (𝑣)) Implies that it suffices to estimate σ𝑣(𝑙 βˆ’ deg 𝑗 (𝑣))

  • deg 𝑗 (𝑣) is given implicitly only; we need to sample

distances to all neighbors to compute deg 𝑗 (𝑣) exactly

slide-32
SLIDE 32

The cost of a 𝑙-NN graph

cost π‘Œ = π‘œπ‘™ + 𝜁 ෍

𝑗

1 + 𝜁 𝑗 β‹… ෍

𝑣

(𝑙 βˆ’ deg 𝑗 (𝑣)) Implies that it suffices to estimate σ𝑣(𝑙 βˆ’ deg 𝑗 (𝑣)) What about simple random sampling?

sample 𝑑 points and for each sample 𝑠 of their neighbors to estimate 𝑙 βˆ’ deg 𝑗 (𝑣)

Won’t work well: – Cluster with π‘œ βˆ’ 1 points and a single outlier – We need to find the single outlier – Sample size must be 𝛁(π‘œ)

slide-33
SLIDE 33

The cost of a 𝑙-NN graph

cost π‘Œ = π‘œπ‘™ + 𝜁 ෍

𝑗

1 + 𝜁 𝑗 β‹… ෍

𝑣

(𝑙 βˆ’ deg 𝑗 (𝑣)) Implies that it suffices to estimate σ𝑣(𝑙 βˆ’ deg 𝑗 (𝑣)) Idea (inspired by MST approximation due to C&Sohler, 2009):

  • If 𝑣 has many (≫ 𝑙) neighbors of distance ≀ 1 + 𝜁 𝑗

 𝑣 doesn’t contribute to our cost function

– random sampling can detect it

  • Otherwise, we can afford slower random sampling
slide-34
SLIDE 34

Approximating number of vertices of given degree

For a given point 𝑣, sample vertices uniformly at random

  • If the number of sampled vertices with distance ≀

1 + 𝜁 𝑗 to 𝑣 is sufficiently large then 𝑣 is not-useful

  • Otherwise, double the sample size and repeat until it

becomes 𝑑 = 𝑃(π‘œ log π‘œ /𝑙)

– return 𝑣 as useful

  • If 𝑣 has 𝑒 points at distance ≀ 1 + 𝜁 𝑗, then w.h.p., in

expected time 𝑃(π‘œ log π‘œ/ (𝑙 + 𝑒)):

– if 𝑒 β‰₯ 4𝑙 then we mark 𝑣 as non-useful – If 𝑒 ≀ 𝑙 then we mark 𝑣 as useful

slide-35
SLIDE 35

Approximating number of vertices of given degree

Using this approach, we can take a sample of size 𝑃(π‘œ 1 + 𝜁 𝑗/πœπ‘ƒ(1)MST(X)) to get a desired error Expected running time evaluation on a randomly chosen vertex is 𝑃(𝑙 log π‘œ π‘π‘‡π‘ˆ(π‘Œ)/ 1 + 𝜁 𝑗) Theorem We can approximate the cost of a 𝑙-NN graph in time 𝑃(π‘œπ‘™2/πœπ‘ƒ(1)) with an error of 𝜁 cost(X) + π‘π‘‡π‘ˆ(π‘Œ)

  • 𝑙2 can be improved to 𝑙3/2
slide-36
SLIDE 36

Lower bound

clusters of size 𝑙 + 1 each cost(X) ~ 0, MST(X) ~ π‘œ/𝑙

slide-37
SLIDE 37

Lower bound

clusters of size 𝑙 + 1 each cost(X) ~ 0, MST(X) ~ π‘œ/𝑙 In other instance: in some clusters move 𝑙 points to some other cluster

slide-38
SLIDE 38

Lower bound

clusters of size 𝑙 + 1 each cost(X) ~ 0, MST(X) ~ π‘œ/𝑙 cost(Y) ≫ 0, MST(Y) ~ π‘œ/𝑙 In other instance: in some clusters move 𝑙 points to some other cluster

slide-39
SLIDE 39

Lower bound

clusters of size 𝑙 + 1 each cost(X) ~ 0, MST(X) ~ π‘œ/𝑙 cost(Y) ≫ 0, MST(Y) ~ π‘œ/𝑙 To distinguish between the instances one needs Ω𝜁(π‘œπ‘™3/2) queries (for small 𝑙); for large 𝑙 it’s still Ξ©(π‘œ2/𝑙) queries In other instance: in some clusters move 𝑙 points to some other cluster

slide-40
SLIDE 40

Approximating cost(X): Summary of results

We have tight bounds for estimation of cost(X) When we want a (1 + 𝜁)–approximation:

  • ΰ·©

Θ𝜁(n2/k) queries are sufficient and necessary When happy with 𝜁(cost(X) + MST(X)) additive error:

  • ΰ·©

Θ𝜁(min{π‘œπ‘™3/2, π‘œ2/𝑙}) queries are sufficient/necessary

slide-41
SLIDE 41

Open problems

Sublinear algorithms in metric spaces:

  • more problems approximated in sublinear-time
  • if we add some other parameters to the error bound

(like MST) – can we approximate more problems in sublinear time

slide-42
SLIDE 42

THANK YOU!