Continuous Nearest Neighbor Search Yufei Tao, Dimitris Papadias, Qiongmao Shen Hong Kong University of Science and Technology Clear Water Bay, Hong Kong 1
Point Nearest Neighbor (NN) Queries [Roussopoulos et al SIGMOD95, Hjaltason and Samet TODS 99] • Branch and bound algorithms use mindist between the query point q and an R-tree entry E, to prune the search space: – mindist ( E , q ) = The minimum distance between E and q y axis 8 E 1 6 q mindist(q,E 2 ) 4 mindist(q,E 1 ) E 2 2 x axis 0 8 10 2 4 6 2
Nearest Neighbor Search (NN) with R-Trees • Depth-first (DF) and Best-first (BF) algorihms: y axis Root E 7 10 E3 E1 E2 E E e f 1 2 8 1 2 8 E E2 8 E1 g E d 5 6 i E7 E8 E9 E h E5 E6 E E4 9 6 query point 13 2 17 5 9 contents 5 4 omitted E 4 b search a region i e f h g a c d 2 b c E 3 5 2 13 10 13 10 13 18 13 x axis E8 E4 E5 10 0 2 4 6 8 Action Heap Result {empty} Visit Root E E E 1 2 8 1 2 3 follow E E E E E {empty} E 1 5 5 8 4 5 3 9 2 6 2 follow E E E E E E E {empty} E 13 17 2 5 5 8 9 7 9 4 5 3 2 6 8 follow E E E E E E {(h, 2 )} E 17 8 5 8 13 4 5 5 3 9 7 9 6 Report h and terminate 3
Problem: Continuous Nearest Neighbor Data : A set of points Query : A line segment q =[ s , e ] Result : The nearest neighbor (NN) of every point on q . Result representation : { s (.NN= a ), s 1 (.NN= c ), s 2 (.NN= f ), s 3 (.NN= h ), e } For the sake of simplicity we present Continuous 1-NN, while the solution 4 generalizes to k -NN, and trajectories of multiple line segments (see paper).
Previous Approach – Time Parameterized Queries (Tao and Papadias, SIGMOD 02) Step 1 : Find the NN of the start point s , i.e., point a . Step 2 : Use the TP technique to find: The first point on the line segment ( s c ) where there is a change in the NN (i.e., point c ) will become the next NN. 5
TP NN (cont) From Step 2 we have decided the next NN change is point c at s 1 Step 3 : Perform another TP NN to find: Starting from s 1 , how far we need to travel for the current NN (i.e., c ) to change. Repeat this until we finish the entire segment. 6 Problem: # of TP queries = # of NN changes (i.e., output sensitive)
Our Goal Find all split points s 1 , s 2 , s 3 (as well as the corresponding NN for each partition) with a single traversal of the dataset . Term1 : The set of split points (including s and e ) constitute the split list . Term2 : The circle that centers at split point s i with radius dist ( s i , s i .NN) is the vicinity circle of s i . Term3 : We say a data point u covers a point s if u = s .NN. E.g., points a , c , f , h cover segments [ s , s 1 ], [ s 1 , s 2 ], [ s 2 , s 3 ], [ s 3 , e ]. 7
Lemma 1 Given a split list SL { s 0 , s 1 , …, s |SL − 1| }, and a new data point p , then: p covers some point on query segment q if and only if p covers a split point. After processing a After processing c 8
Lemma 2 (Covering Continuity) The split points covered by a point p are continuous. Namely, if p covers split point s i but not s i − 1 (or s i+1 ), then p cannot cover s i − j (or s i+j ) for any value of j >1. d c p a b g f . . . s i s i+1 s i+2 s i+3 . . . q s i-1 SL ={ s i-1 (.NN= a ), s i (.NN=b) , s i+1 (.NN= c ), s i+2 (.NN= d ), s i+3 (.NN= f )} 9
Algorithm with R-trees Overview Use branch-and-bound techniques to prune the search space. When a leaf entry (i.e., a data point) p is encountered SL is updated if p covers any split point (i.e., p is a qualifying entry ) – By Lemma 1. For an intermediate entry We visit its subtree only if it may contain any qualifying data point – Use heuristics . 10
Heuristic 1 Given an intermediate entry E and query segment q , the subtree of E may contain qualifying points only if mindist ( E , q ) < SL MAXD , where mindist ( E , q ) denotes the minimum distance between the MBR of E and q SL MAXD is the maximum distance between a split point and its NN. 11
Heuristic 2 Given an intermediate entry E and query segment q , the subtree of E must be searched if and only if there exists a split point s i ∈ SL such that dist ( s i , s i .NN) > mindist ( s i , E ). Heuristic 2 requires mindist computation between E and all split points. Hence it is applied only if E passes heuristic 1, which requires only one computation. 12
Heuristic 3 (Access Order) Entries (satisfying heuristics 1 and 2) are accessed in increasing order of their minimum distances to the query segment q . Before processing E 1 After processing E 1 13
An Example (Depth First Approach) SL={ s (.NN= f ), e (.NN= f )} SL={ s (.NN= f ), s 1 (.NN= g ), e (.NN= g )} d g d g a a e e E 4 E 4 E 3 E 3 f s f 1 E 1 E 1 c b c b l l s s k E 6 k E 6 i i m m E E E 2 5 E 2 5 h j h j SL={ s (.NN= b ), s 1 (.NN= f ), ), s 1 (.NN= f ), e (.NN= g )} SL={ s (.NN= k s 2 (.NN= g ), e (.NN= g )} d g d g a a e e E 4 E 4 E 3 E 3 f s f s 2 2 E 1 E 1 s c b s c b 1 l l 1 s s k E 6 k E 6 i i m m E E E 2 E 2 5 5 14 h j h j
Cost Model for Uniform Data (real data are handled with histograms) d NN e E 1 b e d c d NN a E 2 s 1 e s f ( ) ≈ π d 1/ N s NN Actual search region Approximated search region An optimal algorithm on R-trees must access only those nodes whose MBRs intersect the actual search region (i.e., E 1 but not E 2 ). To facilitate the analysis we focus on a more regular (approximated) region 15
Node Access Probability ( ) = = P E q , area E ( ) ACCESS EXT ( ) π + ⋅ + + + 2 d E l E l . . 2 d E l . E l . q l . NN 1 2 NN 1 2 ( ) + ⋅ θ + ⋅ θ 2 . q l E l . | cos | E l . | sin | 16 1 2
Cost Model − h 1 ∑ ( ) = ⋅ NA CNN ( ) N P E l q . , i ACCESS i = i 0 ( ) π + + ⋅ ⋅ + 2 2 d E l . 2 d 2 E l . q l . − h 1 ∑ NN NN = ⋅ N ( ) i + ⋅ ⋅ θ + θ 2 q l E l . . | cos | | sin | = i 0 ( ) = ⋅ = π + ⋅ 2 n N area R ( ) N d 2 d q l . NN SEARCH NN NN Various models have been proposed for E . l and N i in the context of R-tree analysis. Our algorithm is I/O-bounded. Hence the above model (producing number of node accesses) reflects the performance. The performance of non-uniform data can be easily captured with histograms. 17
Experimental Settings Datasets : Uniform Real: CA (130K points), ST (2M points). Queries (each a segment): Location and orientation randomly generated Length is set as a parameter Performance is measured as the average of running 200 queries. Machine : 1Ghz CPU, 256M memory Page size=4K (R-tree node capacity=200) Compare CNN and TP (the only existing solution) 18
Exp 1: Cost Model Evaluation Depth-First Best-first Estimation for optimal algorithm 14 node accesses node accesses 15 12 10 10 8 6 5 4 2 0 0 1% 5% 10% 15% 20% 25% 1% 5% 10% 15% 20% 25% query length query length (k=5) (k=5) 9 9.5 node accesses node accesses 8.5 9 8 8.5 7.5 8 7 7.5 6.5 7 6 1 3 5 7 9 1 3 5 7 9 (query length=12,5%) (query length=12,5%) k k Uniform CA 19
Exp 2: Performance vs Query Length node accesses CPU cost (sec) total cost (sec) CPU percentage 1000 10 10 CNN CNN 78% CNN 77% 76% TP TP 1 TP 100 74% 10% 68% 1 8% 0.1 6% 4% 10 41% 2% 0.01 1% 0.1 1 0.001 1% 5% 10% 15% 20% 25% 1% 5% 10% 15% 20% 25% 1% 5% 10% 15% 20% 25% query length query length query length CA CPU time (sec) total cost (sec) CPU percentage 100 node accesses 100 CNN 10000 CNN CNN 91% 91% 90% 10 TP TP TP 1000 84% 10 42% 80% 38% 1 100 25% 75% 14% 1 7% 0.1 10 3% 0.1 0.01 1 1% 5% 10% 15% 20% 25% 1% 5% 10% 15% 20% 25% 1% 5% 10% 15% 20% 25% query length query length query length ST 20
Exp 3: Performance vs k (number of neighbors to be retrieved for each point) node accesses CPU cost (sec) total cost (sec) CPU percentage 10 1000 10 CNN 88% CNN CNN 81% TP TP 71% 1 TP 100 52% 17% 1 0.1 12% 8% 5% 3% 1% 10 0.01 0.1 1 0.001 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 k k k CA CPU time (sec) total cost (sec) 100 node accesses CPU percentage 100 10000 CNN CNN CNN 94% 10 TP TP 91% 1000 TP 84% 10 1 100 71% 51% 42% 0.1 30% 10 20% 8% 3% 1 0.01 1 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 k k k ST 21
Conclusion A fast algorithm for C- k NN query. Future work : Rectangle data Moving data points Application to road networks (i.e., travel instead of Euclidean distance) 22
Recommend
More recommend