data warehousing olap motivation business intelligence
play

Data Warehousing & OLAP Motivation: Business Intelligence - PowerPoint PPT Presentation

Data Warehousing & OLAP Motivation: Business Intelligence Customer information Product information (customer-id, gender, age, home- (Product-id, category, address, occupation, income, manufacturer, made-in, stock- family-size, )


  1. Data Warehousing & OLAP

  2. Motivation: Business Intelligence Customer information Product information (customer-id, gender, age, home- (Product-id, category, address, occupation, income, manufacturer, made-in, stock- family-size, … ) price, … ) Sales information (customer-id, product-id, #units, unit-price, sales- representative, … ) Business queries: • Which categories of products are most popular for customers in Vancouver? • Find pairs (customer groups, most popular products) Jian Pei: Big Data Analytics -- Multidimensional Analysis 2

  3. In what aspect is he most similar to cases of coronary artery disease and, at the same time, dissimilar to adiposity? Symptoms: overweight, high blood pressure, back pain, short of breadth, chest pain, cold sweat … J. Pei: Finding Outstanding Aspects and Contrast Subspaces 3

  4. Don’t You Ever Google Yourself? • Big data makes one know oneself better • 57% American adults search themselves on Internet – Good news: those people are better paid than those who haven’t done so! (Investors.com) • Egocentric analysis becomes more and more important with big data J. Pei: Finding Outstanding Aspects and Contrast Subspaces 4

  5. Egocentric Analysis • How am I different from (more often than not, better than) others? • In what aspects am I good? http://img03.deviantart.net/a670/i/2010/219/a/e/glee___egocentric_by_gleeondoodles.jpg J. Pei: Finding Outstanding Aspects and Contrast Subspaces 5

  6. Dimensions • “An aspect or feature of a situation, problem, or thing, a measurable extent of some kind” – Dictionary • Dimensions/attributes are used to model complex objects in a divide-and-conquer manner – Objects are compared in selected dimensions/ attributes • More often than not, objects have too many dimensions/attributes than one is interested in and can handle Jian Pei: Big Data Analytics -- Multidimensional Analysis 6

  7. Multi-dimensional Analysis • Find interesting patterns in multi-dimensional subspaces – “Michael Jordan is outstanding in subspaces (total points, total rebounds, total assists) and (number of games played, total points, total assists)” • Different patterns may be manifested in different subspaces – Feature selection (machine learning and statistics): select a subset of relevant features for use in model construction – a set of features for all objects – Different subspaces may manifest different patterns Jian Pei: Big Data Analytics -- Multidimensional Analysis 7

  8. OLAP • Conceptually, we may explore all possible subspaces for interesting patterns • What patterns are interesting? • How can we explore all possible subspaces systematically and efficiently? • Fundamental problems in analytics and data mining Jian Pei: Big Data Analytics -- Multidimensional Analysis 8

  9. OLAP • Aggregates and group-bys are frequently used in data analysis and summarization SELECT time, altitude, AVG(temp) FROM weather GOUP BY time, altitude; – In TPC, 6 standard benchmarks have 83 queries, aggregates are used 59 times, group-bys are used 20 times • Online analytical processing (OLAP): the techniques that answer multi-dimensional analytical (MDA) queries efficiently Jian Pei: Big Data Analytics -- Multidimensional Analysis 9

  10. OLAP Operations • Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction – (Day, Store, Product type, SUM(sales) à (Month, City, *, SUM(sales)) • Drill down (roll down): reverse of roll-up, from higher level summary to lower level summary or detailed data, or introducing new dimensions Jian Pei: Big Data Analytics -- Multidimensional Analysis 10

  11. Roll Up http://www.tutorialspoint.com/dwh/images/rollup.jpg Jian Pei: Big Data Analytics -- Multidimensional Analysis 11

  12. Drill Down http://www.tutorialspoint.com/dwh/images/drill_down.jpg Jian Pei: Big Data Analytics -- Multidimensional Analysis 12

  13. Other Operations • Dice: pick specific values or ranges on some dimensions • Pivot: “rotate” a cube – changing the order of dimensions in visual analysis http://en.wikipedia.org/wiki/File:OLAP_pivoting.png Jian Pei: Big Data Analytics -- Multidimensional Analysis 13

  14. Dice http://www.tutorialspoint.com/dwh/images/dice.jpg Jian Pei: Big Data Analytics -- Multidimensional Analysis 14

  15. Relational Representation • If there are n dimensions, there are 2 n possible aggregation columns Roll up by model by year by color in a table Jian Pei: Big Data Analytics -- Multidimensional Analysis 15

  16. Difficulties • Many group bys are needed – 6 dimensions à 2 6 =64 group bys • In most SQL systems, the resulting query needs 64 scans of the data, 64 sorts or hashes, and a long wait! Jian Pei: Big Data Analytics -- Multidimensional Analysis 16

  17. Dummy Value “ ALL ” Jian Pei: Big Data Analytics -- Multidimensional Analysis 17

  18. DATA CUBE Model Year Color Sales Chevy 1990 blue 62 Chevy 1990 red 5 CUBE Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 SALES Chevy ALL red 90 Chevy ALL white 236 Model Year Color Sales Chevy ALL ALL 508 CUBE Chevy 1990 red 5 Ford 1990 blue 63 Chevy 1990 white 87 Ford 1990 red 64 Chevy 1990 blue 62 Ford 1990 white 62 Chevy 1991 red 54 Ford 1990 ALL 189 Ford 1991 blue 55 Chevy 1991 white 95 Ford 1991 red 52 Chevy 1991 blue 49 Ford 1991 white 9 Chevy 1992 red 31 Ford 1991 ALL 116 Chevy 1992 white 54 Ford 1992 blue 39 Chevy 1992 blue 71 Ford 1992 red 27 Ford 1990 red 64 Ford 1992 white 62 Ford 1992 ALL 128 Ford 1990 white 62 Ford ALL blue 157 Ford 1990 blue 63 Ford ALL red 143 Ford 1991 red 52 Ford ALL white 133 Ford 1991 white 9 Ford ALL ALL 433 Ford 1991 blue 55 ALL 1990 blue 125 ALL 1990 red 69 Ford 1992 red 27 ALL 1990 white 149 Ford 1992 white 62 ALL 1990 ALL 343 Ford 1992 blue 39 ALL 1991 blue 106 SELECT Model, Year, Color, SUM(sales) AS Sales ALL 1991 red 104 FROM Sales ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 WHERE Model in {'Ford', 'Chevy'} ALL 1992 red 58 AND Year BETWEEN 1990 AND 1992 ALL 1992 white 116 ALL 1992 ALL 284 GROUP BY CUBE(Model, Year, Color); ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941 Jian Pei: Big Data Analytics -- Multidimensional Analysis 18

  19. Semantics of ALL • ALL is a set – Model.ALL = ALL(Model) = {Chevy, Ford } – Year.ALL = ALL(Year) = {1990,1991,1992} – Color.ALL = ALL(Color) = {red,white,blue} Jian Pei: Big Data Analytics -- Multidimensional Analysis 19

  20. OLTP Versus OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date, detailed, flat historical, summarized, multidimensional relational Isolated integrated, consolidated usage repetitive ad-hoc access read/write, index/hash on prim. lots of scans key unit of work short, simple transaction complex query # records tens millions accessed #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response Jian Pei: Big Data Analytics -- Multidimensional Analysis 20

  21. What Is a Data Warehouse? • “ A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management ’ s decision-making process. ” – W. H. Inmon • Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 21

  22. Subject-Oriented • Organized around major subjects, such as customer, product, sales • Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing • Providing a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process Jian Pei: Big Data Analytics -- Multidimensional Analysis 22

  23. Integrated • Integrating multiple, heterogeneous data sources – Relational databases, flat files, on-line transaction records • Data cleaning and data integration – Ensuring consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources • E.g., Hotel price: currency, tax, breakfast covered, etc. – When data is moved to the warehouse, it is converted Jian Pei: Big Data Analytics -- Multidimensional Analysis 23

Recommend


More recommend