Shooting Stars in the Sky An Online Algorithm for Skyline Queries Donald Kossmann Frank Ramsak Steffen Rost kossmann@in.tum.de frank.ramsak@forwiss.de rost@in.tum.de Technische Universität München Institut für Informatik Boltzmannstr. 3 85748 Garching b. München Germany
Outline � Motivation – Skyline & known algorithms – Challenges in online scenarios � The NN algorithm for Skyline queries – Algorithm for 2D – Relationship between NN and Skyline – Algorithm for higher dimensionality � Evaluation � Supporting user control � Summary Shooting Stars in the Sky, VLDB 2002, Hong Kong
What is the Skyline? � Literature: Minimum/maximum vector problem – Two vectors are not comparable – Dominance: A vector/point dominates another point if it is as good or better in all dimensions and better in at least one dimension – Skyline : All points of a data set that are not dominated by any other point y: distance to the beach [km] x: price [€] Shooting Stars in the Sky, VLDB 2002, Hong Kong
Traditional Skyline algorithms � Blocking algorithms: require to read the complete data set – Compare each point with all other points – [Börzsönyi et.al. ICDE 01] • Block-Nested-Loops (BNL): keep window of candidate Skyline points • Divide-and-Conquer (D&C): divide data set and compute partial Skylines and merge them � Progressive algorithms [Tan et.al. VLDB 01] – Bitmap: operations on range-encoded bitmaps – Index: transformation of d-dimensional space to one dimension + B-Tree Shooting Stars in the Sky, VLDB 2002, Hong Kong
The challenge in online scenarios � Compute the first few Skyline points almost instantaneously � Compute more and more results incrementally � “Big picture”: compute Skyline points from the whole range, do not favor points that are good in one dimension Shooting Stars in the Sky, VLDB 2002, Hong Kong
Result ≠ Result - Quality of Results Complete Skyline Progressive:first Online: first 10 points 10 points by Index; by our NN algorithm Order of Bitmap depends on insertion order Shooting Stars in the Sky, VLDB 2002, Hong Kong
The challenge in online scenarios � Compute the first few Skyline points almost instantaneously � Compute more and more results incrementally � “Big picture”: compute Skyline points from the whole range, do not favor points that are good in one dimension � Do not compute good approximations, do only return real Skyline points � User should be able to make preferences while the algorithm is running � control which Skyline points are produced next � Universality w.r. to data sets and type of Skyline queries Shooting Stars in the Sky, VLDB 2002, Hong Kong
The NN algorithm: 2D example y: distance to the beach [km] RangeNNSearch RangeNNSearch x: price [€] Shooting Stars in the Sky, VLDB 2002, Hong Kong
The NN algorithm for 2 dimensions � Input: data set D , monotonic distance function f � Additional structures: to-do list T , keeps information of regions to be processed � Algorithm: T = {( O , ∞ )} while ( T = ∅ ) do ( m x , m y ) = takeElement( T ) if ( ∃ RangeNNSearch( O , D , ( m x , m y ), f )) then ( n x , n y ) = RangeNNSearch( O , D , ( m x , m y ), f ) output n T = T ∪ {( n x , m y ), ( m x , n y )} endif endwhile Shooting Stars in the Sky, VLDB 2002, Hong Kong
Correctness of the NN algorithm � Relationship between Nearest Neighbor (NN) and Skyline Given a data set D with origin O and an arbitrary monotonic distance function f , we can state: – Observation 1: The Nearest Neighbor NN of O in D w.r.t. f is in the Skyline – Observation 2: Given a region R with R= ( O, X )= X , the Nearest Neighbor NN of O in R w.r.t. f is in the Skyline Shooting Stars in the Sky, VLDB 2002, Hong Kong
Extending NN algorithm to higher dimensionality � Observations also hold in d dimensional space � Modification: Processed region is partitioned into d subregions w.r.t. the NN y y n n n n p z z x x n n n n p p � Problem: duplicate Skyline points may occur Solutions: post-filtering, merging, propagation, ... Shooting Stars in the Sky, VLDB 2002, Hong Kong
Comparison with other approaches � Algorithms: – Online algorithm: our NN algorithm – Blocking algorithms: BNL, D&C – Progressive algorithms: Bitmap, Index � Data sets: – Sizes: 100 K points, 1 M points – Distributions: correlated, anti-correlated, independent – Dimensionality: 2 - 10 Shooting Stars in the Sky, VLDB 2002, Hong Kong
Performance in 2D 100K 1M anti corr ind anti corr ind Skyline 49 1 12 54 1 12 NN 0.57 0.02 0.2 0.69 0.02 0.5 BNL 1.77 1.65 1.68 17.16 16.24 16.07 D&C 2.63 2.56 2.63 28.65 28.53 28.50 Bitmap 6.09 0.84 1.40 57.12 12.23 17.90 B-tree 13.86 0.01 0.26 >200 0.12 0.92 Shooting Stars in the Sky, VLDB 2002, Hong Kong
Performance in 2D 100K 1M anti corr ind anti corr ind Skyline 49 1 12 54 1 12 NN 0.57 0.02 0.2 0.69 0.02 0.5 BNL 1.77 1.65 1.68 17.16 16.24 16.07 D&C 2.63 2.56 2.63 28.65 28.53 28.50 Bitmap 6.09 0.84 1.40 57.12 12.23 17.90 B-tree 13.86 0.01 0.26 >200 0.12 0.92 � NN depends on data distribution ( � Skyline size), not on data set size Shooting Stars in the Sky, VLDB 2002, Hong Kong
Performance in 2D 100K 1M anti corr ind anti corr ind Skyline 49 1 12 54 1 12 NN 0.57 0.02 0.2 0.69 0.02 0.5 BNL 1.77 1.65 1.68 17.16 16.24 16.07 D&C 2.63 2.56 2.63 28.65 28.53 28.50 Bitmap 6.09 0.84 1.40 57.12 12.23 17.90 B-tree 13.86 0.01 0.26 >200 0.12 0.92 � NN depends on data distribution ( � Skyline size), not on data set size � BNL, D&C depend on data set size only Shooting Stars in the Sky, VLDB 2002, Hong Kong
Performance in 2D 100K 1M anti corr ind anti corr ind Skyline 49 1 12 54 1 12 NN 0.57 0.02 0.2 0.69 0.02 0.5 BNL 1.77 1.65 1.68 17.16 16.24 16.07 D&C 2.63 2.56 2.63 28.65 28.53 28.50 Bitmap 6.09 0.84 1.40 57.12 12.23 17.90 B-tree 13.86 0.01 0.26 >200 0.12 0.92 � NN depends on data distribution ( � Skyline size), not on data set size � BNL, D&C depend on data set size only � Bitmap depends on data size and data distribution Shooting Stars in the Sky, VLDB 2002, Hong Kong
Performance in 2D 100K 1M anti corr ind anti corr ind Skyline 49 1 12 54 1 12 NN 0.57 0.02 0.2 0.69 0.02 0.5 BNL 1.77 1.65 1.68 17.16 16.24 16.07 D&C 2.63 2.56 2.63 28.65 28.53 28.50 Bitmap 6.09 0.84 1.40 57.12 12.23 17.90 B-tree 13.86 0.01 0.26 >200 0.12 0.92 � NN depends on data distribution ( � Skyline size), not on data set size � BNL, D&C depend on data set size only � Bitmap depends on data size and data distribution � Index depends on data distribution and on data size Shooting Stars in the Sky, VLDB 2002, Hong Kong
Performance in higher dimensional spaces � d < 4: NN is typically the winner in all respects � d >= 4: Performance depends on goal – Complete Skyline: BNL and D&C are usually the best choice to compute the complete Skyline – Big picture: NN produces the big picture the fastest – Data rate: Index produces Skyline points at the highest rate, but always returns “extreme” points first � no big picture Shooting Stars in the Sky, VLDB 2002, Hong Kong
Providing control � Goal: user should be able to determine order of Skyline points at run-time � Order of region processing – Influences „direction“ � Adaptation of distance function – Does not change Skyline → Observation 1&2 – Influences order of points � Bitmap and Index lack control – Order of Skyline points is determined by one- dimensional mapping and insertion order resp. Shooting Stars in the Sky, VLDB 2002, Hong Kong
Example: User control y: distance to the beach [km] RangeNNSearch (DF2) RangeNNSearch (DF1) x: price [€] Shooting Stars in the Sky, VLDB 2002, Hong Kong
Summary and future work � Online algorithm for Skyline based on NN-search � NN algorithm – returns first Skyline points instantaneously – builds complete Skyline incrementally – generates a “big picture“ of the Skyline – generates only Skyline points → no approximation – supports user interaction – is universal � Future work: – Main memory, multidimensional indexing for region list – Continuous Skyline queries Shooting Stars in the Sky, VLDB 2002, Hong Kong
Recommend
More recommend