dw performance optimization i
play

DW Performance Optimization (I) Slides adapted from Torben Bach - PowerPoint PPT Presentation

DW Performance Optimization (I) Slides adapted from Torben Bach Pedersen Overview Choosing Aggregates Maintaining Views Bitmap Indices Aalborg University 2008 - DWML course 2 Aggregates Observations DW queries follow


  1. DW Performance Optimization (I) Slides adapted from Torben Bach Pedersen

  2. Overview • Choosing Aggregates • Maintaining Views • Bitmap Indices Aalborg University 2008 - DWML course 2

  3. Aggregates • Observations � DW queries follow the same “schema” � Aggregate (GROUP-BY) queries • Idea � Compute and store query results in advance (pre-aggregation) ◆ Example: store “total sales per month and product” � Yields large performance improvements (e.g., factor 1000) � No need to store everything – reuse whenever possible ◆ Example: quarterly total can be quickly computed from monthly total • Prerequisites � Tree-structured dimensions with fixed height � Many-to-one relationships from fact to dimensions � Facts mapped to bottom level in all dimensions Aalborg University 2008 - DWML course 3

  4. Aggregate Use Example Sales • Imagine 1 billion sales rows, 1000 products, 100 locations tid pid locid sales 1 1 1 10 • CREATE VIEW TotalSales (pid,locid,total) AS 2 1 1 20 SELECT s.pid,s.locid,SUM(s.sales) 3 2 3 40 FROM Sales s … … … … GROUP BY s.pid,s.locid 1 billion rows • Pre-compute the above view and store the result into a table (e.g., TotalSales with 100,000 rows) • Rewrite the query to use the view TotalSales SELECT p.category,SUM(s.sales) FROM Products p, � Sales s WHERE p.pid=s.pid GROUP BY p.category pid locid sales 1 1 30 ◆ can be rewritten to 2 3 40 SELECT p.category,SUM(t.total) FROM Products p, � TotalSales t WHERE p.pid=t.pid GROUP BY p.category … … … ◆ Query becomes 10,000 times faster! 100,000 rows Aalborg University 2008 - DWML course 4

  5. Pre-Aggregation Choices • Full pre-aggregation: (all combinations of levels) � Fast query response � Takes a lot of space/update time (200-500 times raw data) • No pre-aggregation � Slow query response (for terabytes…) • Practical pre-aggregation: chosen combinations � A good compromise between response time and space use � The approach supported by ROLAP tools ◆ IBM DB2 UDB ◆ Oracle 10g ◆ MS Analysis Services Aalborg University 2008 - DWML course 5

  6. Using Aggregates • The term “aggregate” refers to materialized view • Application: aggregates used via aggregate navigator � A set of aggregates stored in the system (i.e., DBMS) � Given a query, the best aggregate is found , and the query is rewritten to use it � Done by the system, not by the user • Four design goals for aggregate usage � Aggregates stored separately from detail data � “ Shrunk ” dimensions mapped to aggregate facts � Connection between aggregates and detail data known by system � All queries (SQL) refers only detail data • SUM, MIN, MAX, COUNT, AVG functions can be handled � The AVG function requires a trick – Why and how? Aalborg University 2008 - DWML course 6

  7. Choosing Aggregates • Practical pre-aggregation, decide what aggregates to store • Non-trivial (NP-complete) optimization problem � Space use � Update speed � Response time demands � …… • Choose an aggregate if it is considerably smaller than available, usable aggregates (factor 3-5-10) • “Implementing Data Cubes Efficiently” SIGMOD’96 paper � Greedy approach: simple but effective � Commerical DBMS now use similar, but more advanced techniques Aalborg University 2008 - DWML course 7

  8. Data Cube The data cube stores multidimensional GROUP BY relations of tables in data warehouses mc Group By (with total) Aggregate c By color m Red none Sum White Database capable Blue By make and year By make 1991 1992 1993 F 1994 o r Sum d Sum C h e v y Data warehouse Cross Tab capable Chevy Ford By color Red White Blue By make and color By make By color and year Sum Aalborg University 2008 - DWML course 8

  9. Lattice Framework • Each lattice node represents a view / query Dependence relationship: Q1 ≤ Q2 • Q1 is the descendant of Q2 � Q1 can be answered using only the results of Q2 � • In other words, Q1 is dependent on Q2 The ≤ operator imposes a partial ordering on the queries � • Top view = base data (i.e., most detail) • Essentially, the lattice captures dependency relationship among queries / views and can be represented by a lattice graph Aalborg University 2008 - DWML course 9

  10. A Data Cube Example 1. part, supplier, customer (6M rows) Scenario: 2. part, customer (6M) A query asks for the sales of a part 3. part, supplier (0.8M) 4. supplier, customer (6M) a) If view pc is available, need to 5. part (0.2M) 19 M rows total process about 6M rows; OR 6. supplier (0.01M) b) If view p is available, need to 7. customer (0.1M) process about 0.2M rows 8. none (1) (I) – 8 possible views for 3 dimensions. Each view (III) – Picking the right views (IV) – View pc and sc not to materialize � improve gives the total sales as needed � Effective rows psc 6M per that grouping. performance reduced from 19M to 7M pc 6M ps 0.8M sc 6M p 0.2M s 0.01M c 0.1M Problem: Given that we have space S, what views to materialize for minimizing query costs? none 1 (II) – 8 views organized into a Lattice Aalborg University 2008 - DWML course 10

  11. Cost Model Model simple and realistic, Cost of answering a query based on Enable the design and analysis a view and its major assumptions of powerful algorithms Time to answer a query is equal to 1 An illustration: the space occupied by the query from which the query is answered • To answer query Q, choose an ancestor of Q, say, Qa, that has been materialized • We thus need to process the table of Qa All queries are identical to some 2 queries in the given lattice • Cost of answering Q: number of rows in the table of Qa Size S Time T Ratio m Source Experimental validation of the cost From cell itself 2.07 - 1 model: almost linear relationship From view s 2.38 .000031 10,000 between size and running time … From view ps 20.77 .000023 0.8M From view psc 226.23 .000037 6M Aalborg University 2008 - DWML course 11

  12. Greedy Approach • We plan to solve the problem by the greedy approach � Being as greedy as possible in each step! � Now give you a taste of this approach …… • Example: Pay 5 kroner with the smallest number of coins � Suppose we have many coins of 2 kroner, 1 krone, and 50 ores (i.e., 50 cents). • How to be greedy: � Select the largest coin: 2 kroner � Select the largest coin again: 2 kroner � Remaining amount = 5 – 2 – 2 = 1 krone � We cannot select the largest coin again � We choose the next largest coin 1 krone instead • Only 3 coins are needed Aalborg University 2008 - DWML course 12

  13. Greedy Algorithm Greedy algorithm: each time choose the view with the maximum “benefit” Explanation: The Greedy algorithm • Given a data cube lattice with space costs S = {top view} associated with each view for i=1 to k do begin • Always include the top view because it cannot be generated from other views Select a view v not in S such that B(v,S) is maximized; • Given a number k, we are allowed to select k views (in addition to the top view) S = S union {v}; • B(v,S), the benefit of view v (relative to S), end; denotes the amount that v can improve the costs of evaluating views w, including itself return S; • B(v,S) = Σ view w max {Cost(w,S) – Cost(w,v), 0} only positive benefit taken into account Aalborg University 2008 - DWML course 13

  14. Greedy Algorithm – Example 100 a 50 b C 75 30 e Benefits of possible choices in 1 st round 20 d f 40 Round 1 g h 10 1 50 x 5 = 250 b 25 x 5 = 125 c Lattice with space costs 80 x 2 = 160 d e 70 x 3 =210 Initial: top view a is automatically chosen 60 x 2 =120 f 99 x 1 = 99 g 90 x 1 = 90 h • Round 1 : each query answered by a, with cost 100 End of round 1: view b selected • Which queries can be answered by view c? • What is the benefit of view c? Ans: 5*(100-75)=125 • Can we find a better view than view c? How about b? Aalborg University 2008 - DWML course 14

  15. Greedy Algorithm – Example 100 a Round 1: view b selected 50 b C 75 30 e 20 d f 40 Benefits of possible choices at each round Round 1 Round 2 g h 10 50 x 5 = 250 b 1 25 x 5 = 125 c 25 x 2 = 50 Lattice with space costs 80 x 2 = 160 d 30 x 2 = 60 20 x 3 = 60 e 70 x 3 =210 60 x 2 =120 60 + 10 = 70 f • Round 2 : views b, d, e, g, h start with cost 50 99 x 1 = 99 g 49 x 1 = 49 and the other views start with cost 100 90 x 1 = 90 h 40 x 1 = 40 • Which queries can be improved by view f? End of round 2: view f selected • Benefit of view f? Ans: (100-40)+(50-40)=70 • How about the benefit of view c? Aalborg University 2008 - DWML course 15

More recommend