top k query processing
play

Top-K Query Processing D. Gunopulos 1 Multimedia Top-K Queries - PDF document

Top-K Query Processing D. Gunopulos 1 Multimedia Top-K Queries The IBM QBIC project (90s): How to store and index multimedia objects? How to store and index multimedia objects? Multimedia objects can have many attributes, with


  1. Top-K Query Processing D. Gunopulos 1 Multimedia Top-K Queries The IBM QBIC project (90’s): How to store and index multimedia objects? How to store and index multimedia objects? • Multimedia objects can have many attributes, with numerical, “fuzzy”, values e.g. Figure1: 0.7 red, Figure2: 0.4 red, same for blue, etc • How to find similar objects? 2 1

  2. Retrieving Multimedia Objects Must address the similarity question • The user specifies the query as a function of the attributes: Find the objects most similar to: (Red = 50), (Green = 20), (Blue = 30) with Red twice as important as Green or Blue How to rank the objects? How to rank the objects? • Must find the best (most similar) objects without accessing everything 3 The right model: Top-K Queries! Data Management & Query Processing Today We are living in a world where data is generated All The Time & Everywhere 4 2

  3. Characteristics of the Applications • “Data is generated in a distributed fashion” e.g. sensor data, file-sharing data, Geographically Distributed Clusters) Geographically Distributed Clusters) • “Distributed Data is often outdated before it is ever used” (e.g. CCTV video traces, Internet ping data, sensor readings, weblogs, RFID Tags,…) • “Transferring the Data to a centralized repository is usually more expensive than storing it locally” 5 Motivating Problems • “In-situ Data Storage & Retrieval” – Data remains in-situ (at the generating site). – When users want to search/retrieve some When users want to search/retrieve some information they perform on-demand queries . • Challenges: – Combine different attributes and data sources – Minimize the utilization of the communication medium di – Exploit the network and the inherent parallelism of a distributed environment. Focus on ubiquitous Hierarchical Networks (e.g. P2P, and sensor-nets). – Number of Answers might be very large � 6 Focus on Top-K queries 3

  4. Top-K Query Example • Assume that we have a cluster of n=5 webservers. • Each server maintains locally the same m=5 y webpages. • When a web page is accessed by a client, a server increases a local hit counter by one. 7 Top-K Query Example • TOP-1 Query: “Which Webpage has the highest number of hits across all servers (i.e. highest S Score(o i ) ) ?” ( ) ) ?” • Score(o i ) can only be calculated if we combine the hit count from all 5 servers. Local score URL { m 8 n TOTAL SCORE 4

  5. Top-K Query Processing Other Applications • C ll b Collaborative Spam Detection Networks ti S D t ti N t k • Content Distribution Networks • Information Retrieval • Sensor Networks 9 Top-K Query Processing Setting Setting for this talk: • Vertical partitioning: Independent access Independent access for each attribute (or sets of attributes) • Assume index or sorted access per attribute attribute • Centralized or distributed setting 10 5

  6. Top-K Query Processing Setting What kinds of queries? • Monotone functions of the attributes: – f(x 11 , x 21 ,x 31 ) >= f(x 12 ,x 22 ,x 32 ) if f( ) f( ) if x 11 >= x 12 and x 21 >= x 22 and x 31 >= x 32 • Typically assume linear functions X 1 = + Q f Q 3 X 2 X 1 2 T R tid 1 (0,1) (1,1) 1 (1,0) 11 O (0,0) P X 2 Presentation Outline • Introduction to Top-K Query Processing • Centralized techniques – Fagin’s Algorithm F i ’ Al ith – Optimal Algorithms: TA (Threshold Algorithm) – Restricted Access Models: TA-Sorted – Probabilistic TA-Sorted – Using previous query instantiations: LPTA • • Distributed techniques Distributed techniques • Online Algorithms for Monitoring Top-K results • Future Work 12 6

  7. Fagin’s Algorithm [Fagin, PODS’98], [Fagin, Lotem, Naor, PODS’01] The first efficient algorithm Assumes an index per attribute F Α Algorithm 1) Access the n lists in parallel. 2) Stop after the values of K objects have been found 3) While some object o i is seen, but not resolved, perform a random access to the other lists to find the complete score for o i . 4) Return the K objects with the highest score 13 Fagin’s Algorithm (Example) v1 v2 v3 v4 v5 TOP-K o3 4 05 /5= 81 o3, 4.05 /5 .81 o3, 99 o3 99 o1 91 o1, 91 o1 92 o1, 92 o3 74 o3, 74 o3 67 o3, 67 O3 405 O3, 405 o1, 66 o3, 90 o3, 75 o1, 56 o4, 67 O1, 363 O4, 207 o0, 63 o0, 61 o4, 70 o2, 56 o1, 58 o2, 48 o4, 07 o2, 16 o0, 28 o2, 54 o4, 44 o2, 01 o0, 01 o4, 19 o0, 35 Iteration 1 Partial Scores for o1, 03 Iteration 2 Resolve o3, partial scores for o1, o4 For Top-1, we resolve o1, o4 with random accesses For Top-2, we continue with Iteration 3 14 7

  8. Fagin’s* Threshold Algorithm [Fagin, Lotem, Naor, PODS’01], [Guntzer, Balke, Kieling, VLDB’00] [Nepal, Ramakrishna, ICDE’99] Long studied and well understood. * Concurrently developed by 3 groups * C tl d l d b 3 ΤΑ Algorithm 1) Access the n lists in parallel. 2) While some object o i is seen, perform a random access to the other lists to find the complete score for o i . 3) Do the same for all objects in the current row. 3) Do the same for all objects in the current row 4) Now compute the threshold τ as the sum of scores in the current row. 5)The algorithm stops after K objects have been found with a score above τ . 15 The Threshold Algorithm (Example) v1 v2 v3 v4 v5 TOP-K o3 4 05 /5= 81 o3, 4.05 /5 .81 o3 99 o3, 99 o1, 91 o1 91 o1, 92 o1 92 o3 74 o3, 74 o3 67 o3, 67 O3, 405 O3 405 o1, 66 o3, 90 o3, 75 o1, 56 o4, 67 O1, 363 O4, 207 o0, 63 o0, 61 o4, 70 o2, 56 o1, 58 o2, 48 o4, 07 o2, 16 o0, 28 o2, 54 o4, 44 o2, 01 o0, 01 o4, 19 o0, 35 Iteration 1 Threshold τ = 99 + 91 + 92 + 74 + 67 => τ = 423 Have we found K=1 objects with a score above τ ? => ΝΟ Iteration 2 Threshold τ (2nd row) = 66 + 90 + 75 + 56 + 67 => τ = 354 Have we found K=1 objects with a score above τ ? => YES! 16 8

  9. Comparison of Fagin’s and Threshold Algorithm • TA sees less objects than FA • TA may perform more random accesses than FA • TA requires only bounded buffer space ( k ) • At the expense of more random seeks • FA makes use of unbounded buffers FA makes use of unbounded buffers 17 Optimal algorithms Algorithm B is instance optimal over set of algorithms A and set of inputs D if : B E A and Cost( B,D ) = O(Cost( A,D )) A E A, D E D Which means that: Cost( B,D ) ≤ c · Cost( A,D ) + c’, A E A, D E D Theorem [Fagin et. al. 2003]: TA is instance optimal for every monotone TA is instance optimal for every monotone aggregation function, over every database ( excluding wild guesses) 18 9

  10. TA-Sorted TA makes random accesses, assumes they are possible and inexpensive In many situations, random accesses are much In many situations, random accesses are much more expensive than sequential access, or may be difficult to implement. TA-sorted uses sequential access only Assumes sorted access for each attribute ΤΑ Sorted Algorithm ΤΑ -Sorted Algorithm 1) Access the n lists in parallel. 2) When some object o i is seen, update its Upper and Lower bound. 3)The algorithm stops after K objects have been resolved and all other objects have a lower upper bound. 19 The TA-Sorted Algorithm (Example) v1 v2 v3 v4 v5 TOP-K o3, 4.05 /5 .81 o3 4 05 /5= 81 o3 99 o3, 99 o1 91 o1, 91 o1, 92 o1 92 o3 74 o3, 74 o3 67 o3, 67 O3 405 O3, 405 o1, 66 o3, 90 o3, 75 o1, 56 o4, 67 O1, 363 o0, 63 o0, 61 o4, 70 o2, 56 o1, 58 o2, 48 o4, 07 o2, 16 o0, 28 o2, 54 o4, 44 o2, 01 o0, 01 o4, 19 o0, 35 Iteration 1: o1 = [183, 423], o3 = [240,423] Iteration 2: o1 = [305, 372], o3 = 405, o4 = [67, 354] Iteration 3: o1 = 363, o3 = 405, o4 = [137, 317] 20 10

  11. TA-Sorted and Relational Systems [Ilyas, Aref, Elmagarmid, VLDB’03] [Tsaparas, Palpanas, Kotidis, Koudas, Srivastava, ICDE’03] [Bruno, Chauduri, Gravano, ICDE’01] The sorted access makes it easier and The sorted access makes it easier and conceptually simpler to integrate to relational Database Management systems: Ilyas et al preset the RNA-RJ Rank-Join Query operator X 1 Q T R tid 1 (0,1) (1,1) 1 Bruno et al show that Top-K queries can be reduced to multidimensional range queries (using histograms to model (1,0) 21 O (0,0) the data distribution) P X 2 Probabilistic TA-Sorted [Theobald, Weikum, Scenkel, VLDB’04] TA-Sorted can keep large intermediary results A smart idea : Use information about the distribution of the values to eliminate objects that are unlikely to be in the Top-K result The key: compute probabilistic guarantees Probabilistic ΤΑ -Sorted 1) Access the n lists in parallel. 2) When some object o i is seen, update it’s Upper and Lower. 3)The algorithm stops after K objects have been resolved and all other objects have a lower upper bound. 22 11

Recommend


More recommend