Data Warehouse and OLAP Data Warehouse and OLAP Week 5 1
Midterm I Midterm I • Friday, March 4 • Scope – Homework assignments 1 – 4 – Open book
Team Homework Assignment #7 Team Homework Assignment #7 • Read pp. 121 – 139, 146 – 150 of the text book. R d 121 139 146 150 f h b k • Do Examples 3.8, 3.10 and Exercise 3.4 (b) and (c). Prepare for the results of the homework assignment. the results of the homework assignment. • Due date – beginning of the lecture on Friday March11 th .
Topics Topics • Definition of data warehouse • Multidimensional data model • Data warehouse architecture • From data warehousing to data mining
What is Data Warehouse? (1) What is Data Warehouse? (1) • A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site • A data warehouse is a semantically consistent data store that serves as a physical implementation of a decision support data model and stores the information on which an enterprise need to make strategic decisions
What is Data Warehouse? (2) What is Data Warehouse? (2) • Data warehouses provide on ‐ line analytical Data warehouses provide on line analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitate effective data generalization and data mining • Many other data mining functions, such as association, classification, prediction, and clustering, can be integrated with OLAP operations to enhance b i t t d ith OLAP ti t h interactive mining of knowledge at multiple levels of abstraction abstraction
What is Data Warehouse? (3) What is Data Warehouse? (3) • A decision support database that is maintained separately from the organization’s operational database • “ A data warehouse is a subject ‐ oriented, integrated, time ‐ variant, and nonvolatile collection of data in support of management’s decision ‐ making process [Inm96].”—W. H. Inmon
Data Ware ho use F Data Ware ho use F rame wo rk rame wo rk data mining Figure 1.7 Typical framework of a data warehouse for AllElectronics 8
Data Warehouse is S bj Subject-Oriented O i d • Organized around major subjects, such as customer, product, sales, etc. • Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing • Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
Data Warehouse is I t Integrated t d • Constructed by integrating multiple, heterogeneous data sources sources – relational databases, flat files, on ‐ line transaction records • Data cleaning and data integration techniques are applied Data cleaning and data integration techniques are applied. – Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources • E.g., Hotel price: currency, tax, breakfast covered, etc.
Data Warehouse is Time Variant D t W h i Ti V i t • The time horizon for the data warehouse is significantly longer than that of operational systems g p y – Operational database: current value data – Data warehouse data: provide information from a historical perspective (e.g., past 5 ‐ 10 years) • Every key structure in the data warehouse – Contains an element of time, explicitly or implicitly
D t W Data Warehouse is Nonvolatile h i N l til • A physically separate store of data transformed from the operational environment • Operational update of data does not occur in the data warehouse environment – Does not require transaction processing, recovery, and concurrency control mechanisms – Requires only two operations in data accessing: R i l t ti i d t i • initial loading of data and access of data
OL OL T T P vs OL P vs. OL AP AP a ble 3.1 Co mpariso n be twe e n OL T P and OL AP T 13
Why Separate is Data Warehouse Needed? (1) (1) • Why not perform on ‐ line analytical processing directly on operational databases instead of spending additional time operational databases instead of spending additional time and resources to construct a separate data warehouse?
Why Separate is Data Warehouse N Needed? (2) d d? ( ) • High performance for both systems g p y – DBMS— tuned for OLTP: searching for particular records, indexing, hashing, concurrency control, recovery – Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation (summarization and aggregation) ti )
Topics Topics • Definition of data warehouse • Multidimensional data model • Data warehouse architecture • From data warehousing to data mining
From Tables and Spreadsheets to D Data Cubes C b • A data warehouse is based on a multidimensional A data warehouse is based on a multidimensional data model • This model views data in the form of a data cube This model views data in the form of a data cube • A data cube allows data to be modeled and viewed in multiple dimensions multiple dimensions
From Tables and Spreadsheets to Data C b Cubes (1) (1) • A data cube is defined by facts and dimensions A data cube is defined by facts and dimensions – Facts are data which data warehouse focus on • Fact tables contain numeric measures (such as Fact tables contain numeric measures (such as dollars_sold) and keys to each of the related dimension tables – Dimensions are perspectives with respect to fact • Dimension tables describe the dimension with attributes. For example, item (item_name, brand, type), or time(day week month quarter year) or time(day, week, month, quarter, year)
19 ure 1.6. F ra agme nts o o f re latio ns F ig fro m m a re latio o nal datab base fo r AllE le c tro nic s
F ro m T able s and Spre adshe e ts t D t C b to Data Cube s (2) (2) dimensions Facts (numerical measures) a ble 3.2 A 2-D vie w o f sale s data fo r AllE le c tro nic s ac c o rding to the T di dime nsio ns time and ite m , whe re the sale s are fro m branc he s lo c ate d in i ti d it h th l f b h l t d i the c ity o f Vanc o uve r. T he me asure displaye d is do llar_so ld (in tho usands). 20
F ro m T able s and Spre adshe e ts t D t C b to Data Cube s (3) (3) a ble 3.3 A 3-D vie w o f sale s data fo r AllE le c tro nic s ac c o rding to the T dime nsio ns time , ite m , and lo c atio n . T he me asure displaye d is do llar_so ld (in tho usands). 21
F ro m T able s and Spre adshe e ts t D t C b to Data Cube s (4) (4) ig ure 3.1 A 3-D data c ube re pre se ntatio n o f the data in T able 3.3, F ac c o rding to the dime nsio ns time , ite m , and lo c atio n . T he me asure displaye d is do llar_so ld (in tho usands). 22
F ro m T able s and Spre adshe e ts t D t C b to Data Cube s (5) (5) ig ure 3.2 A 4-D data c ube re pre se ntatio n, ac c o rding to the dime nsio ns F time , ite m , lo c atio n, and supplie r . T time , ite m , lo c atio n, and supplie r . T he me asure displaye d is do llar so ld (in he me asure displaye d is do llar_so ld (in tho usands). 23
Cuboid Cuboid • A data cube is a lattice of cuboids • The total number of cuboids • The apex cuboid • The base cuboid 24
ig ure 3.14 L attic e o f c ubo ids, making up a 3-D data c ube . E g p ac h F g c ubo id re pre se nts a diffe re nt gro up-by. T he base c ubo id c o ntains the thre e dime nsio ns c ity, ite m, and ye ar. 25
The Curse of Dimensionality The Curse of Dimensionality • How many cuboids are there in a n ‐ dimensional data cube? • How many cuboids are there in a n dimensional data cube • How many cuboids are there in a n ‐ dimensional data cube and each dimension (i) has the number of level, (L i )? 26
Conceptual Modeling of Data W Warehouses h • Modeling data warehouses: dimensions & measures M d li d t h di i & – Star schema : A fact table in the middle connected to a set of dimension tables of dimension tables – Snowflake schema : A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake – Fact constellations : Multiple fact tables share dimension F t t ll ti M lti l f t t bl h di i tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation g y
Star Sc he ma Star Sc he ma time time key _ y item day item_key Sales Fact Table day_of_the_week item_name month brand q quarter time_key time key type type year supplier_type item_key branch_key location branch location_key location_key branch_key dollars_sold street branch_name cit city branch_type unit_sold province_or_street country ig ure 3.4 Star sc he ma o f a data ware ho use fo r sale s. F 28
Sno wflake Sc he ma Sno wflake Sc he ma time supplier time_key item supplier_key supplier key day supplier_type item_key day_of_the_week Sales Fact Table item_name month brand quarter time_key type type year item_key supplier_key branch_key location key location_key branch dollars_sold location branch_key units_sold branch_name location_key city b branch_type h t street city_key city city province_or_street country country ig ure 3.4 Sno wflake sc he ma o f a data ware ho use fo r sale s. F 29
Recommend
More recommend