The Proximity Project [Gray, Lee, Rotella, Moore 2005] Careful agostic empirical comparison, open source 15 datasets, dimension 2-1M The most well-known methods from 1972-2004 • Exact NN: 15 methods • All-NN, mono & bichromatic: 3 methods • Approximate NN: 10 methods • Point location: 3 methods • (NN classification: 3 methods) • (Radial range search: 3 methods)
…and the overall winner is? (exact NN, high-D) Ball-trees, basically – though there is high variance and dataset dependence • Auton ball-trees III [Omohundro 91],[Uhlmann 91], [Moore 99] • Cover-trees [Alina B.,Kakade,Langford 04] • Crust-trees [Yianilos 95],[Gray,Lee,Rotella,Moore 2005]
A ball-tree: level 1
A ball-tree: level 2
A ball-tree: level 3
A ball-tree: level 4
A ball-tree: level 5
Anchors Hierarchy [Moore 99] • ‘Middle-out’ construction • Uses farthest-point method [Gonzalez 85] to find sqrt(N) clusters – this is the middle • Bottom-up construction to get the top • Top-down division to get the bottom • Smart pruning throughout to make it fast • (NlogN), very fast in practice
Outline: 1. Physics problems and methods 2. Generalized N-body problems 3. Proximity data structures 4. Dual-tree algorithms 5. Comparison
Questions • What’s the magic that allows O(N) ? Is it really because of the expansions? • Can we obtain an method that’s: 1. O(N) 2. Lightweight: - works with or without ..............................expansions - simple, recursive
New algorithm • Use an adaptive tree ( kd -tree or ball-tree) • Dual-tree recursion • Finite-difference approximation
Single-tree : Dual-tree (symmetric):
Simple recursive algorithm SingleTree (q,R) { if approximate (q,R), return. if leaf(R), SingleTreeBase (q,R). else, SingleTree (q,R.left). SingleTree (q,R.right). } (NN or range-search: recurse on the closer node first)
Simple recursive algorithm DualTree (Q,R) { if approximate (Q,R), return. if leaf(Q) and leaf(R), DualTreeBase (Q,R). else, DualTree (Q.left,R.left). DualTree (Q.left,R.right). DualTree (Q.right,R.left). DualTree (Q.right,R.right). } (NN or range-search: recurse on the closer node first)
Dual-tree traversal (depth-first) Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Dual-tree traversal Reference points Query points
Finite-difference function approximation. Taylor expansion: ′ ≈ + − f ( x ) f ( a ) f ( a )( x a ) Gregory-Newton finite form: − 1 f ( x ) f ( x ) ≈ + − + f ( x ) f ( x ) ( x x ) i 1 i i i − 2 x x + i 1 i δ − δ max min 1 K ( ) K ( ) δ ≈ δ + δ − δ min min K ( ) K ( ) ( ) δ − δ max min 2
Finite-difference function approximation. assumes monotonic decreasing kernel [ ] = δ + δ min max K K ( ) K ( ) 1 2 QR QR N [ ] N ( ) R = ∑ δ − ≤ δ − δ min max err K K K ( ) K ( ) R q qr QR QR 2 r could also use center of mass Stopping rule?
Simple approximation method approximate (Q,R) { = δ = δ dl N K ( ), du N K ( ). R max R min δ ≥ τ ⋅ max( diam ( Q ), diam ( R )) if min incorporate( dl , du ). } � trivial to change kernel � hard error bounds
Big issue in practice… Tweak parameters Case 1 – algorithm gives no error bounds Case 2 – algorithm gives hard error bounds: must run it many times Case 3 – algorithm automatically achives your error tolerance
Automatic approximation method approximate (Q,R) { = δ = δ dl N K ( ), du N K ( ). R max R min δ − δ ≤ N φ ε K ( ) K ( ) ( Q ) 2 if min max min incorporate( dl , du ). return. } � just set error tolerance, no tweak parameters � hard error bounds
Runtime analysis THEOREM: Dual-tree algorithm is O(N) ASSUMPTION: N points from density f < ≤ ≤ 0 c f C
Recurrence for self-finding single-tree (point-node) = + T ( N ) T ( N / 2 ) O ( 1 ) = T ( 1 ) O ( 1 ) ⇒ N ⋅ O (log N ) dual-tree (node-node) = + T ( N ) 2 T ( N / 2 ) O ( 1 ) = T ( 1 ) O ( 1 ) ⇒ O ( N )
Packing bound LEMMA: Number of nodes that are well- separated from a query node Q is bounded by a constant D 1 + g ( s , c , C ) Thus the recurrence yields the entire runtime. Done. (cf. [Callahan-Kosaraju 95]) On a manifold , use its dimension D’ (the data’s ‘intrinsic dimension’).
Recommend
More recommend