9 23 2009
play

9/23/2009 C O NFERENC ES Short Name Full Name Special Interest - PDF document

9/23/2009 C O NFERENC ES Short Name Full Name Special Interest Group on Management Of SIGMOD Data VLDB Very Large Data Base C O NT INUO US N EAREST N EIG HBO R S EARC H ICDE International Conference on Data Engineering Yufei Tao, Dimitris


  1. 9/23/2009 C O NFERENC ES Short Name Full Name Special Interest Group on Management Of SIGMOD Data VLDB Very Large Data Base C O NT INUO US N EAREST N EIG HBO R S EARC H ICDE International Conference on Data Engineering Yufei Tao, Dimitris Papadias, Qiongmao Shen Hong Kong University of Science and Technology 2 Presented : Penny Bei Pan O VERVIEW I NT RO DUC T IO N Object � Introduction � Continuous Nearest Neighbor � Preliminary & Related Work � Continuous k-Nearest Neighbor Query(CkNN) � Definition � Problem Characteristics P bl Ch t i ti � R-tree algorithm Query � Query analysis Point � Complex CNN extension � Why called “continuous”? � Experiments � Nearest neighbor of every points in the trajectory � Discussion and Conclusion 3 4 P REL IMINARY - - P O INT NN Q UERIES P REL IMINARY - - P O INT NN Q UERIES � Branch and bound algorithms use mindist between � Depth-first (DF) and Best-first (BF) algorithms the query point q and an R-tree entry E , to prune � E: R-tree entry the search space: � q: query point � – mindist(E, q) = The minimum distance between E and � DF : choose the entrance with minimum min-dist q � BF: choose the min among all those visited (heap) � BF: choose the min among all those visited (heap) E 1 E 1 E 2 E 2 E 2 E 3 E 4 E 4 E 5 E 6 E 6 E 6 f k l l l m 5 6 1

  2. 9/23/2009 P REL IMINARY - - C O NT INUO US N EAREST N EIG HBO R R EL ED WO RK – S AMPL AT ING � Try to convert the continuous-NN to point-NN f � Every point on the line -> unlimited points � Sampling � Drawback: � Sample Rate: low -> incorrect Sample Rate: low > incorrect h � Sample Rate: high -> overhead (still cannot guarantee a c accuracy) � Data: A set of points (P={ a,b,c,d,f,g,h }) � Query: A line segment q=[s, e] � Time Parameterized queries � Result: The nearest neighbor (NN) of every point on q. � Output (R, T, C) : result, time period, changing point � Result representation: {<a,[s,s 1 ]>, <c,[s 1 ,s 2 ]>, � Tao, Y., Papadias, D. Time Parameterized Queries in 7 8 <f,[s 2 ,s 3 ]>, <h, [s 3 ,e]>} Spatio-Temporal Databases. ACM SIGMOD, 2002. R EL ED WO RK – T IME P ARAMET ERIZED NN R EL ED WO RK – T P NN ( C O NT .) AT AT s 1 � Step 1: Find the NN of the start point s , i.e., point a . � Step 3: Perform another TP NN to find: � Step 2: Use the TP technique to find: The first point � Starting from s1, how far we need to travel for the on the line segment ( s 1 ) where there is a change in current NN (i.e., c ) to change to f . the NN (i.e., point c ) will become the next NN 9 10 � Repeat this until we finish the entire segment. R EL ED WO RK – T P NN ( C O NT .) C K NN - D EFINIT AT IO N d f s 1 s d s f s h s g h g � Goal: Find all split points(as well as the corresponding NN for each partition) with a single traversal. � Intuitively: perpendicular bisector & [s,e] segment � Split list: The set of split points (including s and e). � Not only NN, but support k-NN � Vicinity circle: The circle that centers at split point s i � Still overhead: n times with radius dist(s i , s i .NN) Yufei , Dimitris Tao Papadias 11 � We say a data point u covers a point s if u=s.NN. E.g., 12 points a, c cover segments [s, s1], [s1, s2] 2

  3. 9/23/2009 C K NN – P RO BL EM C HARAC T C K NN - P RO BL EM C HARAC T ERIST IC S ERIST IC S � Lemma 1: Given a split list SL { s 0 , s 1 , …, s |SL − 1| }, and a � Lemma 2: (Covering Continuity) new data point p, then: p covers some point on query � The split points covered by a point p are continuous. segment q if and only if p covers a split point. � Namely, if p covers split point s i but not s i − 1 (or s i+1 ), then p cannot cover s i − j (or s i+j ) for any value of j>1. ┐ s 1 13 14 C K NN - P RO BL EM C HARAC T C K NN – R- T ERIST IC S REE AL G O RIT HM � How about the k-NN? � General key notes: � Lemma 1 : Fit || Lemma 2 : Cannot Fit � Use branch-and-bound techniques to prune the search space. � Eg: � K=3 � R-tree traverse principle: � When a leaf entry (i.e., a data point) p is encountered, SL is updated if p covers any split point (i.e., p is a qualifying entry) – By Lemma 1. � For an intermediate entry, We visit its subtree only if it may contain any qualifying data point – Use heuristics. � Avoid accessing not qualified nodes 15 16 R- T HM – H EURIST IC 1 R- T HM – H EURIST IC 2 ( AFT ER 1) REE AL G O RIT REE AL G O RIT � Given an intermediate entry E and query segment q , � Given an intermediate entry E and query segment q , the sub-tree of E may contain qualifying points only if the subtree of E must be searched if and only if there mindist(E,q) < SL MAXD , where SL MAXD is the maximum exists a split point s i � SL such that dist(s i , s i .NN) > distance between a split point and its NN. mindist(s i , E). 17 18 Compute Mindist(E,q) 3

  4. 9/23/2009 R- T HM – H EURIST IC 3 (O RDER ) R- T HM – L REE AL G O RIT REE AL G O RIT EAF ENT RY � Entries (satisfying heuristics 1 and 2) are accessed in � Input: New entry p , SL ={s 1 ,…s 10 } increasing order of their minimum distances to the � 1) retrieve the split points covered by p query segment q. � 2) update SL � Binary search: Start at s 5 , then s 2 … � Using bisector to judge the direction U i bi t t j d th di ti 19 20 C K NN – R- T HM ( EXAMPL E ) A NAL YSIS - C O ST M O DEL FO R U NIFO RM D AT REE AL G O RIT A � Depth First Actual Search region Approximate Search region � An optimal algorithm on R-trees must access only those nodes whose MBRs intersect the actual search region (i.e., E1 but not E2). � To facilitate the analysis we focus on a more regular 21 22 (approximated) region A NAL YSIS – N O DE A C C ESS P RO BABIL A NAL YSIS – C O ST M O DEL (N O DE A C C ESS ) IT Y � P ACCESS is the probability the MBR E of a node Intersects the Intersects the search region � Dataset cardinality N � R tree structure (Height: h) � The query length: q.l � The orientation angle 23 24 4

  5. 9/23/2009 A NAL YSIS – C O ST M O DEL ( C O NT .) O T HER C NN Q UERY � kCNN query (k=2) � Updating Vicinity circle � The number of distinct neighbors in the final result neighbors in the final result. � Trajectory NN query (TNN) � CPU overhead comparison � q1 = [s,u] � q2 = [u,v] � TP: increase with n NN � q3 = [v,e] � This paper: increase with dataset size N, query � Each segment has a SL length l… � Treated one by one 25 26 26 E E XP 1: C O ST M O DEL E XPERIMENT S VAL UAT IO N � Datasets: � Uniform � Real street segments: CA (130K points), ST (2M points). � Queries (each a segment): � Location and orientation randomly generated Location and orientation randomly generated � Length is set as a parameter � Performance is measured as the average of running 200 queries. � Machine: � 1Ghz CPU, 256M memory � Page size=4K (R-tree node capacity=200) 27 28 � Compare CNN and TP (the only existing solution) E XP 2: P ERFO RMANC E VS Q UERY L E XP 3: P ERFO RMANC E VS K ENG T H 29 30 5

  6. 9/23/2009 E S – KEY NO T XPERIMENT ES D ISC USSIO N AND C O NC L USIO N � A fast algorithm for C- kNN query. In general, CNN outperform TP significantly � � Future work: Single traversal � � Rectangle data For cost model: � � Moving data points BF better than DF (consistent with previous work) � � Application to road networks (i.e., travel instead of Application to road networks (i.e., travel instead of Th The cost model is accurate t d l i t � Euclidean distance) Performance & query Length � Length increase, split points increase � CPU for TP: keep repeat retrieving the same objects � Thank you! Performance & k � For CNN: k has not much influenced on NA, but k influences � CPU: higher number of split points 31 32 6

Recommend


More recommend