Data Warehousing and OLAP
Decision Support Systems • Decision-support systems are used to make business decisions, often based on data collected by online transaction-processing systems • Examples of business decisions: – What items to stock? – What insurance premium to change? – To whom to send advertisements? • Examples of data used for making decisions – Retail sales transaction details – Customer profiles (income, age, gender, etc.) CMPT 354: Database I -- Data Warehousing and OLAP 2
Data and Statistical Analysis • Data analysis tasks are simplified by specialized tools and SQL extensions – Example tasks • For each product category and each region, what were the total sales in the last quarter and how do they compare with the same quarter last year • As above, for each product category and each customer category • Statistical analysis packages (e.g., SAS) can be interfaced with databases CMPT 354: Database I -- Data Warehousing and OLAP 3
Data Analysis and OLAP • Online Analytical Processing (OLAP) – Interactive analysis of data, allowing data to be summarized and viewed in different ways in an online fashion (with negligible delay) • Multidimensional data: data modeled as dimension attributes and measure attributes – Dimension attributes: define the dimensions on which measure attributes (or aggregates thereof) are viewed, e.g. the attributes item_name, color, and size of the sales relation – Measure attributes: can be aggregated upon, e.g., the attribute number of the sales relation CMPT 354: Database I -- Data Warehousing and OLAP 4
Pivot Table • Values for one of the dimension attributes form the row headers • Values for another dimension attribute form the column headers • Other dimension attributes are listed on top • Values in individual cells are (aggregates of) the values of the dimension attributes that specify the cell CMPT 354: Database I -- Data Warehousing and OLAP 5
Relational Representation • Cross-tabs can be represented as relations – The value all is used to represent aggregates – All represents a set – The SQL:1999 standard uses null values in place of all despite confusion with regular null values CMPT 354: Database I -- Data Warehousing and OLAP 6
Data Cubes • A data cube is a multidimensional generalization of a cross-tab • Can have n dimensions • Cross-tabs can be used as views on a data cube CMPT 354: Database I -- Data Warehousing and OLAP 7
Online Analytical Processing • Pivoting: changing the dimensions used in a cross-tab is called • Slicing: creating a cross-tab for fixed values only – Sometimes called dicing, particularly when values for multiple dimensions are fixed • Rollup: moving from finer-granularity data to a coarser granularity • Drill down: The opposite operation - that of moving from coarser-granularity data to finer- granularity data CMPT 354: Database I -- Data Warehousing and OLAP 8
Hierarchies on Dimensions • Enable dimensions be viewed at different levels of detail – Dimension DateTime can be used to aggregate by hour of day, date, day of week, month, quarter or year CMPT 354: Database I -- Data Warehousing and OLAP 9
Cross Tabulation With Hierarchy CMPT 354: Database I -- Data Warehousing and OLAP 10
OLAP Implementation • Multidimensional OLAP (MOLAP) systems – Multidimensional arrays in memory to store data cubes • Relational OLAP (ROLAP) systems – Relational tables to store data cubes • Hybrid OLAP (HOLAP) systems – Store some summaries in memory and store the base data and other summaries in a relational database CMPT 354: Database I -- Data Warehousing and OLAP 11
Extended Aggregation in SQL:1999 • The cube operation computes union of group by’s on every subset of the specified attributes select item-name, color, size, sum(number) from sales group by cube(item-name, color, size) • Compute the union of eight different groupings of the sales relation: { (item-name, color, size), (item- name, color), (item-name, size), (color, size), (item-name), (color), (size), ( ) } • For each grouping, the result contains the null value for attributes not present in the grouping CMPT 354: Database I -- Data Warehousing and OLAP 12
OLTP Versus OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date, detailed, flat historical, summarized, multidimensional relational Isolated integrated, consolidated usage repetitive ad-hoc access read/write, index/hash on prim. lots of scans key unit of work short, simple transaction complex query # records tens millions accessed #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response CMPT 354: Database I -- Data Warehousing and OLAP 13
What Is a Data Warehouse? • “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” – W. H. Inmon • Data warehousing: the process of constructing and using data warehouses CMPT 354: Database I -- Data Warehousing and OLAP 14
Subject-Oriented • Organized around major subjects, such as customer, product, sales • Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing • Providing a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process CMPT 354: Database I -- Data Warehousing and OLAP 15
Integrated • Integrating multiple, heterogeneous data sources – Relational databases, flat files, on-line transaction records • Data cleaning and data integration – Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources • E.g., Hotel price: currency, tax, breakfast covered, etc. – When data is moved to the warehouse, it is converted CMPT 354: Database I -- Data Warehousing and OLAP 16
Time Variant • The time horizon for the data warehouse is significantly longer than that of operational systems – Operational database: current value data – Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) • Every key structure in the data warehouse – Contains an element of time, explicitly or implicitly – But the key of operational data may or may not contain “time element” CMPT 354: Database I -- Data Warehousing and OLAP 17
Nonvolatile • A physically separate store of data transformed from the operational environment • Operational update of data does not occur in the data warehouse environment – Does not require transaction processing, recovery, and concurrency control mechanisms – Requires only two operations in data accessing: • initial loading of data and access of data CMPT 354: Database I -- Data Warehousing and OLAP 18
Data Warehousing CMPT 354: Database I -- Data Warehousing and OLAP 19
Collecting Data • Source driven architecture: data sources transmit new information to a warehouse, either continuously or periodically (e.g. at night) • Destination driven architecture: a warehouse periodically requests new information from data sources • Keeping warehouse exactly synchronized with data sources (e.g. using two-phase commit) is too expensive – Usually OK to have slightly out-of-date data at warehouse – Data/updates are periodically downloaded form online transaction processing (OLTP) systems CMPT 354: Database I -- Data Warehousing and OLAP 20
Design Issues • Data cleansing – Correct mistakes in addresses (misspellings, zip code errors), and merge address lists from different sources and purge duplicates • Update propagating – Warehouse schema may be a (materialized) view of schema from data sources • Summarizing data – Raw data may be too large to store on-line – Aggregate values (totals/subtotals) often suffice – Queries on raw data can often be transformed by query optimizer to use aggregate values CMPT 354: Database I -- Data Warehousing and OLAP 21
Warehouse Schemas • Dimension values are usually encoded using small integers and mapped to full values via dimension tables • Resultant schema is called a star schema – More complicated schema structures • Snowflake schema: multiple levels of dimension tables • Constellation: multiple fact tables CMPT 354: Database I -- Data Warehousing and OLAP 22
Data Warehouse Schema CMPT 354: Database I -- Data Warehousing and OLAP 23
Picture from publib.boulder.ibm.com Snowflake Schema A star schema is a snowflake schema where each dimension has only one single dimension table CMPT 354: Database I -- Data Warehousing and OLAP 24
Why Data Mining? • Evolution of database technology – To collect a large amount of data � primitive file processing – To store and query data efficiently � DBMS • New challenges: huge amount of data, how to analyze and understand? – Data mining CMPT 354: Database I -- Data Warehousing and OLAP 25
Recommend
More recommend