Overview DW Performance Optimization • Choosing aggregates • Maintaining views • Bitmapped indices • Other optimization issues Original slides were written by Torben Bach Pedersen Aalborg University 2007 - DWML course 2 Aggregates Aggregate Use Example • Observations • Imagine 1 billion sales rows, 1000 products, 100 locations � DW queries are simple, follow same “schema” • CREATE VIEW TotalSales (pid,locid,total) AS � Aggregate (GROUP-BY) queries SELECT s.pid,s.locid,SUM(s.sales) • Idea FROM Sales s GROUP BY s.pid,s.locid � Compute and store query results in advance (pre-aggregation) ◆ Example: store “total sales per month and product” • The materialized view has 100,000 rows � Yields large performance improvements (e.g., factor 1000) • Query rewritten to use view � No need to store everything – reuse whenever possible � SELECT p.category,SUM(s.sales) FROM Products p, Sales s ◆ Example: quarterly total can be quickly computed from monthly total WHERE p.pid=s.pid GROUP BY p.category • Prerequisites � Rewritten to � Tree-structured dimensions with fixed height � SELECT p.category,SUM(t.total) FROM Products p, TotalSales t � Many-to-one relationships from fact to dimensions WHERE p.pid=t.pid GROUP BY p.category � Facts mapped to bottom level in all dimensions � Query becomes 10,000 times faster ! � Otherwise, re-use is not possible Aalborg University 2007 - DWML course 3 Aalborg University 2007 - DWML course 4
Pre-Aggregation Choices Using Aggregates • Application: aggregates used via aggregate navigator � Given a query, the best aggregate is found , and the query is • Full pre-aggregation: (all combinations of levels) rewritten to use it � Fast query response � Done by the system, not by the user � Takes a lot of space/update time (200-500 times raw data) ◆ Traditionally done in middleware, e.g., ODBC • No pre-aggregation ◆ Performed directly by DBMS � Slow query response (for terabytes…) • Four design goals for aggregate usage • Practical pre-aggregation: chosen combinations � Aggregates stored separately from detail data � A good compromise between response time and space use � “ Shrunk ” dimensions mapped to aggregate facts • Most (R)OLAP tools now support practical pre-aggregation � Connection between aggregates and detail data known by system � IBM DB2 UDB � All queries (SQL) refers only detail data � Oracle 10g • SUM, MIN, MAX, COUNT, AVG can all be handled � MS Analysis Services � The last one requires little trick – Why? Aalborg University 2007 - DWML course 5 Aalborg University 2007 - DWML course 6 Choosing Aggregates Data Cube The data cube stores multidimensional GROUP BY • Practical pre-aggregation, decide what aggregates to store relations of tables in data warehouses • Non-trivial (NP-complete) optimization problem � Space use Group By (with total) � Update speed Aggregate By color � Response time demands Red Sum � Actual queries White By make and year Database capable Blue � Index and/or aggregates 1 9 2 9 3 Ford 9 4 1 9 9 1 9 9 m Chevy 9 1 1 u By make S • Choose an aggregate if it is considerably smaller than Sum available, usable aggregates (factor 3-5-10) Data warehouse Cross Tab capable • Supported (semi)-automatically by DBMS Chevy Ford By color Red � Oracle, DB2, MS SQL Server White • “Implementing Data Cubes Efficiently” SIGMOD’96 paper By make and color Blue By color and year By make � Greedy approach: simple but effective Sum � Above DBMS now use similar, but more advanced techniques Aalborg University 2007 - DWML course 7 Aalborg University 2007 - DWML course 8
Lattice Framework A Data Cube example (I) – 8 possible 1. part, supplier, customer (6M rows) Scenario: (III) – Picking views for 3 the right views 2. part, customer (6M) dimensions. A query asks for the sales of a part to materialize Each view � improve 3. part, supplier (0.8M) gives the total performance a) If view pc is available, need to sales as per 4. supplier, customer (6M) process about 6M rows; OR • Each lattice node represents a view / query that grouping. 5. part (0.2M) 19 M rows total Dependence relationship: Q1 ≤ Q2 b) If view p is available, need to • 6. supplier (0.01M) process about 0.2M rows Q1 is the descendant of Q2 � 7. customer (0.1M) Q1 can be answered using only the results of Q2 8. none (1) � • In other words, Q1 is dependent on Q2 (IV) – View pc (II) – 8 views Questions: psc 6M and sc not The ≤ operator imposes a partial ordering on the queries organized into � needed � a Lattice a) How many views to Effective rows • Top view = base data (i.e., most detail) materialize to get reasonable reduced from pc 6M ps 0.8M sc 6M performance? - and 19M to 7M • Essentially, the lattice captures dependency relationship among b) Given that we have space S, queries / views and can be represented by a lattice graph p 0.2M s 0.01M c 0.1M what views to materialize to minimize average query none 1 costs? Aalborg University 2007 - DWML course 9 Aalborg University 2007 - DWML course 10 Greedy Algorithm Cost Model Greedy algorithm: each time Model simple and realistic, Cost of answering a query based choose the view with the Enable the design and analysis on a view and it has major maximum “benefit” of powerful algorithms assumptions … Explanation: The Greedy algorithm • Given a data cube lattice with space costs Time to answer a query is equal to An illustration: 1 associated with each view the space occupied by the query from which the query is answered S = {top view} • To answer query Q, choose an ancestor • Always include the top view because it cannot be of Q, say, Qa, that has been materialized generated from other views for i=1 to k do begin All queries are identical to some 2 • We thus need to process the table of Qa queries in the given lattice • Suppose we may only select k number of views in Select that view v not addition to the top view • Cost of answering Q: number of rows in in S such that B(v,S) the table of Qa The clustering of the materialized • The benefit of view v (relative to S), is based on is maximized; 3 query and indexes have not been how v can improve the costs of evaluating views, considered including itself S = S union {v}; • The total benefit of v is the sum over all views w end; Size S Time T Ratio m Source of the benefit of using v to evaluate w, providing that benefit is positive Experimental validation of the cost From cell itself 2.07 - return S; 1 model: almost linear relationship From view s 2.38 .000031 10,000 between size and running time … From view ps 20.77 .000023 0.8M From view psc 226.23 .000037 6M Aalborg University 2007 - DWML course 11 Aalborg University 2007 - DWML course 12
Recommend
More recommend