Ranking– Ordering according to the degree of some fuzzy notions: � Similarity (or dissimilarity) Ranked Query Processing: � Relevance a) Order-based Paradigm � Preference Q Kevin Chen-Chuan Chang ranking 2 Query models for order-based paradigm– When multiple dimensions are available-- On the better-than graph Assume the database stores the information of a set of flights � For each flight � � Better-than graph � Its price t2 � Its route (travel-time or distance traveled) t1 A user would retrieve all the “interesting” flights � � A flight is interesting if and only if there is no other cheaper and t4 t3 shorter (route) at the same time � Best-Matches-Only (BMO) query model � Retrieve maximal elements price y 1 0 � Thus also called maximal vector b e 9 a c � These maximal elements form the “skyline”! 8 7 d 6 g f 5 l h n � On better-than graph, how to process BMO? 4 3 2 i k m 1 x 3 4 o distance 1 2 3 4 5 6 7 8 9 1 0 1
The overall preference combines the Skyline Operation dimensions � P1 LOWEST(price) � Dominance: � a � b � i � c, h � g � d, m � f � n � k, e � l � A point dominates another point if it is no worse in all � P2 LOWEST(distance) dimensions, and better in at least one dimension � k � m, i � h, n � l � f � g � d � c � a � b, e � P :=({price,distance},<P1 ⊗ P2) Distance Price a 1 9 � Skyline: b 2 10 c 4 8 � BMO: Maximal elements of P? � A set of all points in the dataset that are not d 6 7 � Is a maximal? e 9 10 dominated by any other point in the dataset f 7 5 � Is b maximal? g 5 6 h 4 3 � Is c maximal? i 3 2 k 9 1 l 10 4 m 6 2 5 6 n 8 3 Why is it called “skyline”? What is skyline: An example (Also called: Pareto curve, Maximum Vector) � What do you see in the Chicago skyline? � Query: SELECT * FROM flights SKYLINE OF price MIN, distance MIN � What dominates what? � What points constitute the skyline? price y 1 0 b e 9 a c 8 7 d 6 g f 5 l h n 4 3 2 i k m 1 x 7 8 o distance 1 2 3 4 5 6 7 8 9 1 0 2
Skyline Algorithms: We will look at a few examples Block Nested Loop [Börzsönyi et al., 2001] � Block nested loop (BNL) � Conceptually: Nested loop joins— � Divide and Conquer � Joining the table with itself � Compare every pair of points to check dominance � Bitmap � NN Price Distance Price Distance a 1 9 a 1 9 b 2 10 b 2 10 c 4 8 c 4 8 d 6 7 d 6 7 e 9 10 e 9 10 f 7 5 f 7 5 g 5 6 g 5 6 h 4 3 h 4 3 i 3 2 i 3 2 k 9 1 k 9 1 l 10 4 l 10 4 m 6 2 m 6 2 9 n 8 3 n 8 3 10 Block Nested Loop -- Implementation Block Nested Look– Improvements How if the window overflow? � Multi-pass algorithm � One-pass scan: � Scan the table; maintain a window of current skyline points � Scan the table, write any overflow to temp file � Return the window at the end � Scan the temp file; repeat till done Scan Price Distance Skyline Discarded Pass 1 Pass 2 a 1 9 a Scan b 2 10 a b Price Distance Skyline Discarded TempFile c 4 8 a,c a 1 9 a d 6 7 a,c,d b 2 10 a b e 9 10 a ,c,d e c 4 8 a,c f 7 5 a,c,d,f d 6 7 a,c,d Scan g 5 6 a,c,f, g d e 9 10 a ,c,d e h 4 3 a, h c,f,g f 7 5 a,c,d f TempFile i 3 2 a, i h g 5 6 a,c, g d k 9 1 a,i,k h 4 3 a, h c,g � Any problems? l 10 4 a, i ,k l i 3 2 a, i h k 9 1 a,i,k l 10 4 a, i ,k l 11 12 3
Block Nested Look– Improvements However, BNL-based approaches are not How if the window overflow? incremental– Want progressive processing! [Börzsönyi et al., 2001] � Divide and conquer Desired: � Divide all the points into several groups such that each group � Compute the first few Skyline points almost instantaneously fits in memory � Compute more and more results incrementally � Process the groups separately � Merge their results y 10 b e 9 a c � Smart merging possible 8 s1 d 7 s2 � If s3 not empty then disregard s2 6 g f 5 � Use s3 to purge s1, s4 l h n 4 3 s3 s4 2 i k m 1 x o 1 2 3 4 5 6 7 8 9 10 13 14 Bitmap Algorithm: Representation [Tan et. al. Is b = (3, 2, 1) in the skyline? 2001] � For each dimension: � Any point with no-worse values in all dimensions? � 0110 & 0101 & 1111 = 0100 � n distinct values � n bits � Any point with a better value in some dimension? � A value as a bitmap of all no-higher bits = 1 � 0010 | 0001 | 1001 = 1011 d1: price d2: dist d3: rating � Any point satisfying both? 4 3 2 1 3 2 1 2 1 � 0100 & 1011 = 0000 a (1,1,2) 0 0 0 1 0 0 1 1 1 b (3,2,1) 0 1 1 1 0 1 1 0 1 � So, is b = (3,2,1) in the skyline? c (4,1,1) 1 1 1 1 0 0 1 0 1 d (2,3,2) 0 0 1 1 1 1 1 1 1 d1: price d2: dist d3: rating 4 3 2 1 3 2 1 2 1 a (1,1,2) 0 0 0 1 0 0 1 1 1 b (3,2,1) 0 1 1 1 0 1 1 0 1 c (4,1,1) 1 1 1 1 0 0 1 0 1 d (2,3,2) 0 0 1 1 1 1 1 1 1 15 16 4
The Bitmap Algorithm Bitmap Algorithm: Problems � for each point x in DB: � Bitmaps are not dynamic structures � check if x is in skyline � Hard to update � Bitmaps can have prohibitive space overhead � output x if so � How if there are many distinct values? � E.g., How about continuous values? � Incremental indeed; bitmap computation efficient � No focus of directions at all in skyline search � Depend on what points you check first � However, any problem? 17 18 NN – Finding the First Skyline Point [Kossmann et. NN– Are there other skyline points? al. 2002] Start by finding the nearest neighbor of the origin � = + � I.e., the point p = ( x , y ) with the smallest 2 2 dist ( o , p ) x y � Pruning-- What cannot be in the skyline? � How to find NN: Use NN algorithm based on R-tree. � Those dominated by point I � This NN point must be in the skyline � Iteration– What may be in the skyline? � Otherwise? � Non-dominated region 2 and 3 y 10 b e y 10 9 a c b e 9 8 a c 8 7 d 7 d 6 4 3 g f 6 g 5 f l 5 h n l 4 n h 4 3 3 2 i k 2 m i m k 1 2 1 1 x x o o 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 19 20 5
NN– Iteratively Process All the “ ToDo” Order-based rank query evaluation-- Still Regions until All Done ongoing research. y 10 � How optimal are these algorithms? Further b e 9 a c i improvement? 8 d 7 6 g 4 3 f 5 l h n y 4 10 3 4 b e 3 � Scale to high dimensionality? 9 a c 2 i m k 8 1 1 2 x 7 a o d 1 2 3 4 5 6 7 8 9 10 6 1 2 g f 5 h n l � Generalize to non-BMO type of aggregations? 4 y 3 10 b e 2 i m k 9 a c 1 x 8 o 1 2 3 4 5 6 7 8 9 10 7 d 6 k g f 5 l n h 4 3 m 2 i 3 k 4 21 22 1 2 1 x o 1 2 3 4 5 6 7 8 9 10 Ranking Query Processing: Thank You! b) Score-based Paradigms Kevin Chen-Chuan Chang 23 6
Ranking– Ordering according to the degree of Relational DBMS scenarios– A brief overview some fuzzy notions: � Similarity (or dissimilarity) Relational DBMS– � Relevance � Value mapping: [Chaudhuri and Gravano, 1999] � Preference � Mapping top-k scores to Boolean selection ranges Q � May have to restart � Cardinality mapping: [Carey and Kossmann, 1997, 1998] � Pushing “limit k” down query tree � May have to restart ranking 25 26 Our Focus: Middleware scenarios Top-k algorithms rely on accesses to evaluate query scores u j To each predicate p i : select h.id , h.address from Hotel h Random access: ra i ( u j ) p 1 : rating ( h.rate ), p 2 : cheap ( h.price ), p 3 : safe ( h.zip )) � order by F=min ( p 1 stop after k=10 Return score of u j for p i � p 1 [ u j ] Sorted access : sa � i Return some next best object and � k=10 its score for p i RDBMS p 1 u 3: .70 Top -k top results p 2 hotels.com u 2: .65 Algorithm u 1: .60 p 3 F=min ( p 1 , p 2 , p 3 ) p 1 apbs.com 27 28 7
Goal: Minimize the “ access” cost An algorithm performs a sequence of accesses: A simple algorithm � Sorted access on P1 then random accesses to P2, P2 RDBMS rating s 1 =3ms, r 1 =20ms Top -k hotels.com cheap Algorithm c:.80 b:.45 a:.30 k=1 s 2 =44ms, r 2 =466ms p 1 RDBMS apbs.com safe a:.8 b:.90 c:.90 c:.80 s 3 = ∞ , r 3 =700ms F, k top result p 2 Sort hotels.com a:.9 b:.70 c:.95 Access costs dominate in “middleware” scenarios p 3 F=min ( p 1 , p 2 , p 3 ) apbs.com � Cost model: aggregate of all access costs 29 30 Assumption: Monotonic scoring functions The Naïve Algorithm � Get all p i [ u ] score for every object u � Monotonic: � f ( x 1 , …, x n ) ≤ f ( x 1 ′ , …, x n ′ ) if x i ≤ x i ′ for all i � e.g., by complete sorted accesses � Compute F [ u ] = F ( p 1 [ u ] ,…, p m [ u ] ) for every u � Why good for query evaluation? � Sort � Gives bounds for pruning data � Return top k � Gives a simple function “surface” to maximize f � Reasonable? � Obviously expensive. Can we do better? � Analogy: Negation rarely used in Boolean queries � Note k is typically small � But, new “function-inference” front -ends also found this to be violated in many cases 31 32 8
Recommend
More recommend