Data Mining Top-K and Skyline October 17, 2017 1
Outline Ranking and skyline Top- k algorithms Skyline algorithms Reconciling top-k and skyline 2
Ranking queries Who is the best NBA player? According to points : Tracy McGrady, score 2003 According to rebounds : Shaquille O'Neal, score 760 According to points + rebounds : Tracy McGrady, score 2487 …… Name Points Rebounds Assists Steals …… Tracy McGrady 2003 484 448 135 …… Kobe Bryant 1819 392 398 86 …… Shaquille O'Neal 1669 760 200 36 …… Yao Ming 1465 669 61 34 …… Dwyane Wade 1854 397 520 121 …… Steve Nash 1165 249 861 74 …… …… …… …… …… …… 3
Ranking queries Top- k Query Given a dataset D of n objects, a scoring function F (according to which we rank the objects in D) and k, a Top-k query returns the k objects with the best score (rank) in D. 4
Similarity queries K-NN Query Given a dataset D of n objects, a query point q, a distance function F and k, a k-NN query returns the k objects with the smallest distance to q. 5
Problems of top-K and k-NN In a Top- k and k -NN query the ranking/distance function F as well as the number of answers k must be provided by the user. In many cases it is difficult to define a meaningful ranking/distance function, especially when the attributes have different semantics (e.g., find the cheapest hotel closest to the beach). 6
Skyline: Hotel Example price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 p 4 300 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 p 10 p 9 100 34 140 p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities
Skyline: Hotel Example price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 p 4 300 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 p 10 p 9 100 34 140 p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities
Skyline: Hotel Example price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 p 4 300 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 p 10 p 9 100 34 140 p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities
Skyline: Hotel Example price hotel distance price 0.75*Distance + 0.25*price/10 400 p 1 4 400 13 p 1 p 2 p 2 24 380 27.5 p 3 p 3 14 340 19 300 p 4 P 4 36 300 34.5 p 5 p 6 p 5 26 280 26.5 200 p 6 8 260 12.5 p 7 p 8 p 7 40 200 35 p 9 p 8 20 180 19.5 p 10 100 p 9 34 140 29 p 11 p 10 28 120 24 p 11 16 60 13.5 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities
Skyline: Hotel Example price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 p 4 300 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 p 9 p 10 100 34 140 p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Definition ( Skyline ). Given a dataset P of n points in d -dimensional space. Let p and p t be two different points in P , p dominates p t , if for all i , p [ i ] ≤ p t [ i ], and for at least one i , p [ i ] < p t [ i ]. The skyline points are those points that are not dominated by any other point in P . Skyline Computation: Challenges and Opportunities
Skyline Queries: Patient Similarity Search Example Skyline Queries Table:Sample of heart disease dataset. (a) Original data. ID age trestbps 40 140 p 1 39 120 p 2 45 130 p 3 37 140 p 4 trestbps 140 p 4 p 1 130 p 3 q 120 p 2 110 45 age 35 40 Query point: q(41,125) Skyline Computation: Challenges and Opportunities
Motivating Example: Skyline Queries Table:Sample of heart disease dataset. (a) Original data. (b) Mapped Data. ID age trestbps ID age trestbps 40 140 42 140 p 1 t 1 39 120 43 130 p 2 t 2 45 130 45 130 p 3 t 3 37 140 45 140 p 4 t 4 trestbps 140 p 4 p 1 t 1 t 4 t 3 130 t 2 p 3 q 120 p 2 110 45 age 35 40 Query point: q(41,125). Skyline Computation: Challenges and Opportunities
Motivating Example: Skyline Queries Table:Sample of heart disease dataset. (a) Original data. (b) Mapped Data. ID age trestbps ID age trestbps 40 140 42 140 p 1 t 1 39 120 43 130 p 2 t 2 45 130 45 130 p 3 t 3 37 140 45 140 p 4 t 4 trestbps 140 p 4 p 1 t 1 t 4 t 3 130 t 2 p 3 q 120 p 2 110 45 age 35 40 Query point :q(41,125). Skyline Computation: Challenges and Opportunities
Skyline Applications Recommendation: recommend phones as cheap as possible, as large memory capacity as possible, as light weight as possible Aggregation/integration: rank results from multiple search engines with relevance score Preprocessing for top-k: all candidates for top-1 15
Skyline for Top-1 price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 p 4 300 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 p 10 p 9 100 34 140 p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities
What about Top-K? price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 p 4 300 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 p 10 p 9 100 34 140 p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities
Skyline for TopK price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 300 p 4 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 100 p 10 p 9 34 140 Lowest Price p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination • Skyline: pareto top-1 points • Group skyline: pareto top-k groups Skyline Computation: Challenges and Opportunities
Group skyline definition: Dominance Definition ( G-Skyline ). We say group G dominates group G t , denoted by G ≺ g G t , if we can find two permutations of the t t k points for G and G t , G = { p u 1 , p u 2 , ..., p uk } and G t = { p t } , such that p Ç p t for all i v 1 , p v 2 , ..., p vk ui vi (1 ≤ i ≤ k ) and p ui ≺ p t vi for at least one i . The k -point G-Skyline consists of those groups with k points that are not g-dominated by any other group with same size. price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 p 4 300 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 p 10 p 9 100 34 140 p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities
Hotel Example price hotel distance price p 1 4 400 400 p 1 p 2 24 380 p 2 p 3 14 340 p 3 300 p 4 36 300 p 4 p 5 p 5 26 280 p 6 p 6 8 260 200 p 7 p 7 40 200 p 8 p 8 20 180 p 9 100 p 10 p 9 34 140 Lowest Price p 10 28 120 p 11 p 11 16 60 10 20 30 40 distance to the destination Skyline Computation: Challenges and Opportunities
Outline Ranking and skyline Top- k algorithms Skyline algorithms Reconciling top-k and skyline 22
Introduction – naïve methods Top-k processing Apply the ranking function F to all objects Unsorted: linearly scan all objects (online) Sorted list: sorting all objects (offline) Priority queue: build queue (offline), remove top-k (online) Offline computation needs to know the scoring function! 23
Top- k Computation – FA algorithm F agin’s Algorithm (FA) R. Fagin, Amnon Lotem, Moni Naor . “ Optimal Aggregation Algorithms for Middleware ”. J. Comput. Syst. Sci. 66(4), pp. 614-656, 2003. The algorithm is based on two types of accesses: Sorted access on attribute a i : retrieves the next object in the sorted list of a i Random access on attribute a i : gives the value of the i -th attribute for a specific object identifier. 24
Top- k Computation The database can be considered as an n x m score matrix, storing the score values of every object in every attribute. a1 a2 a3 a4 a5 O 3 , 99 O 1 , 91 O 1 , 92 O 3 , 74 O 3 , 67 O 1 , 66 O 3 , 90 O 3 , 75 O 1 , 56 O 4 , 67 O 0 , 63 O 0 , 61 O 4 , 70 O 0 , 56 O 1 , 58 O 2 , 48 O 4 , 07 O 2 , 16 O 2 , 28 O 2 , 54 O 4 , 44 O 2 , 01 O 0 , 01 O 4 , 19 O 0 , 35 Note that, for each attribute scores are sorted in descending order. 25
Top- k Computation – FA algorithm Outline of FA Step 1: • Read attributes from every sorted list using sorted access. • Stop when k objects have been seen in common from all lists. Step 2: • Use random access to find missing scores. Step 3: • Compute the scores of the seen objects. • Return the k highest scored objects. 26
Recommend
More recommend