Continuous Nearest Neighbor Search Yufei Tao, Dimitris Papadias, - PowerPoint PPT Presentation

Continuous Nearest Neighbor Search Yufei Tao, Dimitris Papadias, Qiongmao Shen Hong Kong University of Science and Technology Clear Water Bay, Hong Kong 1

Point Nearest Neighbor (NN) Queries [Roussopoulos et al SIGMOD95, Hjaltason and Samet TODS 99] • Branch and bound algorithms use mindist between the query point q and an R-tree entry E, to prune the search space: – mindist ( E , q ) = The minimum distance between E and q y axis 8 E 1 6 q mindist(q,E 2 ) 4 mindist(q,E 1 ) E 2 2 x axis 0 8 10 2 4 6 2

Nearest Neighbor Search (NN) with R-Trees • Depth-first (DF) and Best-first (BF) algorihms: y axis Root E 7 10 E3 E1 E2 E E e f 1 2 8 1 2 8 E E2 8 E1 g E d 5 6 i E7 E8 E9 E h E5 E6 E E4 9 6 query point 13 2 17 5 9 contents 5 4 omitted E 4 b search a region i e f h g a c d 2 b c E 3 5 2 13 10 13 10 13 18 13 x axis E8 E4 E5 10 0 2 4 6 8 Action Heap Result {empty} Visit Root E E E 1 2 8 1 2 3 follow E E E E E {empty} E 1 5 5 8 4 5 3 9 2 6 2 follow E E E E E E E {empty} E 13 17 2 5 5 8 9 7 9 4 5 3 2 6 8 follow E E E E E E {(h, 2 )} E 17 8 5 8 13 4 5 5 3 9 7 9 6 Report h and terminate 3

Problem: Continuous Nearest Neighbor Data : A set of points Query : A line segment q =[ s , e ] Result : The nearest neighbor (NN) of every point on q . Result representation : { s (.NN= a ), s 1 (.NN= c ), s 2 (.NN= f ), s 3 (.NN= h ), e } For the sake of simplicity we present Continuous 1-NN, while the solution 4 generalizes to k -NN, and trajectories of multiple line segments (see paper).

Previous Approach – Time Parameterized Queries (Tao and Papadias, SIGMOD 02) Step 1 : Find the NN of the start point s , i.e., point a . Step 2 : Use the TP technique to find: The first point on the line segment ( s c ) where there is a change in the NN (i.e., point c ) will become the next NN. 5

TP NN (cont) From Step 2 we have decided the next NN change is point c at s 1 Step 3 : Perform another TP NN to find: Starting from s 1 , how far we need to travel for the current NN (i.e., c ) to change. Repeat this until we finish the entire segment. 6 Problem: # of TP queries = # of NN changes (i.e., output sensitive)

Our Goal Find all split points s 1 , s 2 , s 3 (as well as the corresponding NN for each partition) with a single traversal of the dataset . Term1 : The set of split points (including s and e ) constitute the split list . Term2 : The circle that centers at split point s i with radius dist ( s i , s i .NN) is the vicinity circle of s i . Term3 : We say a data point u covers a point s if u = s .NN. E.g., points a , c , f , h cover segments [ s , s 1 ], [ s 1 , s 2 ], [ s 2 , s 3 ], [ s 3 , e ]. 7

Lemma 1 Given a split list SL { s 0 , s 1 , …, s |SL − 1| }, and a new data point p , then: p covers some point on query segment q if and only if p covers a split point. After processing a After processing c 8

Lemma 2 (Covering Continuity) The split points covered by a point p are continuous. Namely, if p covers split point s i but not s i − 1 (or s i+1 ), then p cannot cover s i − j (or s i+j ) for any value of j >1. d c p a b g f . . . s i s i+1 s i+2 s i+3 . . . q s i-1 SL ={ s i-1 (.NN= a ), s i (.NN=b) , s i+1 (.NN= c ), s i+2 (.NN= d ), s i+3 (.NN= f )} 9

Algorithm with R-trees Overview Use branch-and-bound techniques to prune the search space. When a leaf entry (i.e., a data point) p is encountered SL is updated if p covers any split point (i.e., p is a qualifying entry ) – By Lemma 1. For an intermediate entry We visit its subtree only if it may contain any qualifying data point – Use heuristics . 10

Heuristic 1 Given an intermediate entry E and query segment q , the subtree of E may contain qualifying points only if mindist ( E , q ) < SL MAXD , where mindist ( E , q ) denotes the minimum distance between the MBR of E and q SL MAXD is the maximum distance between a split point and its NN. 11

Heuristic 2 Given an intermediate entry E and query segment q , the subtree of E must be searched if and only if there exists a split point s i ∈ SL such that dist ( s i , s i .NN) > mindist ( s i , E ). Heuristic 2 requires mindist computation between E and all split points. Hence it is applied only if E passes heuristic 1, which requires only one computation. 12

Heuristic 3 (Access Order) Entries (satisfying heuristics 1 and 2) are accessed in increasing order of their minimum distances to the query segment q . Before processing E 1 After processing E 1 13

An Example (Depth First Approach) SL={ s (.NN= f ), e (.NN= f )} SL={ s (.NN= f ), s 1 (.NN= g ), e (.NN= g )} d g d g a a e e E 4 E 4 E 3 E 3 f s f 1 E 1 E 1 c b c b l l s s k E 6 k E 6 i i m m E E E 2 5 E 2 5 h j h j SL={ s (.NN= b ), s 1 (.NN= f ), ), s 1 (.NN= f ), e (.NN= g )} SL={ s (.NN= k s 2 (.NN= g ), e (.NN= g )} d g d g a a e e E 4 E 4 E 3 E 3 f s f s 2 2 E 1 E 1 s c b s c b 1 l l 1 s s k E 6 k E 6 i i m m E E E 2 E 2 5 5 14 h j h j

Cost Model for Uniform Data (real data are handled with histograms) d NN e E 1 b e d c d NN a E 2 s 1 e s f ( ) ≈ π d 1/ N s NN Actual search region Approximated search region An optimal algorithm on R-trees must access only those nodes whose MBRs intersect the actual search region (i.e., E 1 but not E 2 ). To facilitate the analysis we focus on a more regular (approximated) region 15

Node Access Probability ( ) = = P E q , area E ( ) ACCESS EXT ( ) π + ⋅ + + + 2 d E l E l . . 2 d E l . E l . q l . NN 1 2 NN 1 2 ( ) + ⋅ θ + ⋅ θ 2 . q l E l . | cos | E l . | sin | 16 1 2

Cost Model − h 1 ∑ ( ) = ⋅ NA CNN ( ) N P E l q . , i ACCESS i = i 0 ( )   π + + ⋅ ⋅ + 2 2 d E l . 2 d 2 E l . q l . − h 1 ∑ NN NN = ⋅   N ( ) i + ⋅ ⋅ θ + θ 2 q l E l . . | cos | | sin |     = i 0 ( ) = ⋅ = π + ⋅ 2 n N area R ( ) N d 2 d q l . NN SEARCH NN NN Various models have been proposed for E . l and N i in the context of R-tree analysis. Our algorithm is I/O-bounded. Hence the above model (producing number of node accesses) reflects the performance. The performance of non-uniform data can be easily captured with histograms. 17

Experimental Settings Datasets : Uniform Real: CA (130K points), ST (2M points). Queries (each a segment): Location and orientation randomly generated Length is set as a parameter Performance is measured as the average of running 200 queries. Machine : 1Ghz CPU, 256M memory Page size=4K (R-tree node capacity=200) Compare CNN and TP (the only existing solution) 18

Exp 1: Cost Model Evaluation Depth-First Best-first Estimation for optimal algorithm 14 node accesses node accesses 15 12 10 10 8 6 5 4 2 0 0 1% 5% 10% 15% 20% 25% 1% 5% 10% 15% 20% 25% query length query length (k=5) (k=5) 9 9.5 node accesses node accesses 8.5 9 8 8.5 7.5 8 7 7.5 6.5 7 6 1 3 5 7 9 1 3 5 7 9 (query length=12,5%) (query length=12,5%) k k Uniform CA 19

Exp 2: Performance vs Query Length node accesses CPU cost (sec) total cost (sec) CPU percentage 1000 10 10 CNN CNN 78% CNN 77% 76% TP TP 1 TP 100 74% 10% 68% 1 8% 0.1 6% 4% 10 41% 2% 0.01 1% 0.1 1 0.001 1% 5% 10% 15% 20% 25% 1% 5% 10% 15% 20% 25% 1% 5% 10% 15% 20% 25% query length query length query length CA CPU time (sec) total cost (sec) CPU percentage 100 node accesses 100 CNN 10000 CNN CNN 91% 91% 90% 10 TP TP TP 1000 84% 10 42% 80% 38% 1 100 25% 75% 14% 1 7% 0.1 10 3% 0.1 0.01 1 1% 5% 10% 15% 20% 25% 1% 5% 10% 15% 20% 25% 1% 5% 10% 15% 20% 25% query length query length query length ST 20

Exp 3: Performance vs k (number of neighbors to be retrieved for each point) node accesses CPU cost (sec) total cost (sec) CPU percentage 10 1000 10 CNN 88% CNN CNN 81% TP TP 71% 1 TP 100 52% 17% 1 0.1 12% 8% 5% 3% 1% 10 0.01 0.1 1 0.001 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 k k k CA CPU time (sec) total cost (sec) 100 node accesses CPU percentage 100 10000 CNN CNN CNN 94% 10 TP TP 91% 1000 TP 84% 10 1 100 71% 51% 42% 0.1 30% 10 20% 8% 3% 1 0.01 1 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 k k k ST 21

Conclusion A fast algorithm for C- k NN query. Future work : Rectangle data Moving data points Application to road networks (i.e., travel instead of Euclidean distance) 22

Continuous Nearest Neighbor Search Yufei Tao, Dimitris Papadias, - PowerPoint PPT Presentation

Continuous Nearest Neighbor Search Yufei Tao, Dimitris Papadias, Qiongmao Shen Hong Kong University of Science and Technology Clear Water Bay, Hong Kong 1 Point Nearest Neighbor (NN) Queries [Roussopoulos et al SIGMOD95, Hjaltason and Samet

Nearest Neighbor and Locality-Sensitive Hashing Nearest Neighbor Set Similarity

NEAREST NEIGHBOR RULE Jeff Robble, Brian Renzenbrink, Doug Roberts Nearest Neighbor Rule

CSCI 447/547 MACHINE LEARNING Outline Nearest Neighbor K-Nearest Neighbor Algorithm

High-Dimensional Nearest Neighbor Search High-Dimensional Nearest Neighbor Search Who?

Graph-based Nearest Neighbor Search: From Practice to Theory Liudmila Prokhorenkova, Aleksandr

Simple and Fast Nearest Neighbor Search Marcel Birn, Manuel Holtgrewe, Peter Sanders , Johannes

The Nearest Neighbor Algorithm The Nearest Neighbor Algorithm Hypothesis Space Hypothesis Space

Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search Sariel Har-Peled

Simultaneous Nearest Neighbor Search Piotr Indyk Robert Kleinberg MIT Cornell Sepideh

Nearest Neighbor Classification Machine Learning 1 This lecture K-nearest neighbor

Learning From Data Lecture 16 Similarity and Nearest Neighbor Similarity Nearest Neighbor M.

BAYES AND NEAREST NEIGHBOR BAYES AND NEAREST NEIGHBOR CLASSIFIERS CLASSIFIERS Matthieu R Bloch

Data-dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Based on

9/28/2009 Nearest Neighbor Queries What are the two nearest stars to Andromeda? Reverse

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal Jiantao Jiao

Statistical machine translation in a few slides Mikel L. Forcada 1 , 2 1 Departament de Llenguatges

WEBINAR WEDNESDAY IB Insights: The New IB Math Curriculum and University Considerations October

A computation with Bernstein projectors of depth 0 for SL(2) Allen Mo y Chia go September

The SL(2) sector of at strong coupling Ivan Kostov IPhT-Saclay with Didina Serban and Dmytro

( a | s, ) . = Pr { A t = a | S t = s } n 1 r ( ) . p ( s , r |

Market Timing: Why and How Mark Pankin MDP Associates LLC Registered Investment Advisor March

Board Succession Planning BoardVision Julie Hembrock Daum, Practice Leader, North American Board

Statistical inference in a spiked population model Jian-feng Yao Joint work with Weiming Li