11/16/2017 CSI 4352, Introduction to Data Mining Chapter 4 , Data W arehouse and OLAP Operations Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Chapter 4 , Data W arehouse & OLAP Operations Basic Concept of Data W arehouse Data W arehouse Modeling Data W arehouse Architecture Data W arehouse I m plem entation From Data W arehousing to Data Mining 1
11/16/2017 What is Data Warehouse? Data W arehouse ( defined in many different ways ) A decision support database that is maintained separately from the organization’s operational database The support of information processing by providing a solid platform of consolidated, historical data for analysis “A data warehouse is a (1) subject-oriented , (2) integrated , (3) time-variant , and (4) nonvolatile collection of data in support of management’s decision-making process.” — W. H. I nmon Data W arehousing The process of constructing and using data warehouses Data Warehouse – Subject-Oriented Organized around major subjects e.g., customers, products, sales Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing Provide a simple and concise view around particular subject issues Excluding data that are not useful in the decision support process 2
11/16/2017 Data Warehouse – Integrated I ntegrating multiple, heterogeneous data sources Relational databases, flat files, on-line transaction records Apply data cleaning and data integration techniques Ensures consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources Data Warehouse – Time Variant The time horizon of data warehouses is significantly longer than that of operational systems Operational databases have current data values Data warehouses provide information from a historical perspective (e.g., 10-20 years) Time is a key structure in data warehouses Contain the attribute of time (explicitly or implicitly) 3
11/16/2017 Data Warehouse – Nonvolatile A physically separate storage of data transformed from operational databases Operational update of data does not occur Not require transaction processing, recovery, and concurrency control mechanisms Require only two operations, initial loading of data and access of data Data Integration Methods Methods (1) Process to provide uniform interface to multiple data sources → Tradition Database I ntegration (2) Process to combine multiple data sources into coherent storage → Data warehousing Traditional DB I ntegration A query-driven approach Wrappers / mediators on top of heterogeneous data sources Data W arehousing An update-driven approach Combined the heterogeneous data sources in advance Stored them in a warehouse for direct query and analysis 4
11/16/2017 OLTP vs. OLAP OLTP ( on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: e.g., purchasing, inventory, manufacturing, banking, payroll, registration, accounting, etc. OLAP ( on-line analytical processing) Major task of data warehouse system Data analysis and decision making Distinct Features ( OLTP vs. OLAP) User and system orientation (customers vs. market analysts) Data contents (current, detailed vs. historical, consolidated) Database design (ER + application vs. star + subject) View (current, local vs. evolutionary, integrated) OLTP vs. OLAP Feature OLTP OLAP Characteristic operational processing information processing Orientation transaction analysis Users clerk, DBA knowledge worker (CEO, analyst) Function day-to-day operations decision support DB Design application-oriented subject-oriented Data current, up-to-date historical, integrated, summarized Unit of work short, simple transaction complex query Access read/write/update read-only (lots of scans) 5
11/16/2017 Why Data Warehouse? Perform ance I ssue DBMS: tuned for OLTP e.g., access methods, indexing, concurrency control, recovery Data warehouse: tuned for OLAP e.g., complex queries, multidimensional view, consolidation Data I ssue Decision support requires historical data, consolidated and summarized data, consistent data CSI 4352, Introduction to Data Mining Chapter 4 , Data W arehouse & OLAP Operations Basic Concept of Data W arehouse Data W arehouse Modeling Data W arehouse Architecture Data W arehouse I m plem entation From Data W arehousing to Data Mining 6
11/16/2017 Data Format for Warehouse Dim ensions Multi-dimensional data model Data are stored in the form of a data cube Data Cube A view of multi-dimensions Dimension tables, such as item (item_name, brand, type), or time (day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables Cuboid Each combination of dimensional spaces in a data cube 0-D cuboid, 1-D cuboid, 2-D cuboid, … , n-D cuboid The Lattice of Cuboids all 0-D (apex) cuboid time item location supplier 1-D cuboids time,location item,location location,supplier time,item 2-D cuboids time,supplier item,supplier time,item,location time,location,supplier 3-D cuboids time,item,supplier item,location,supplier 4-D (base) cuboid time, item, location, supplier 7
11/16/2017 Conceptual Modeling Key of Modeling Data W arehouses Handling dimensions & measures Exam ples Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Example of Star Schema time item time_key day item_key day_of_the_week Sales Fact Table item_name month brand time_key quarter type year supplier_type item_key branch_key branch location_key location branch_key units_sold location_key branch_name branch_type street dollars_sold city state avg_sales country Measures 8
11/16/2017 Example of Snowflake Schema time item time_key supplier day item_key day_of_the_week Sales Fact Table item_name supplier_key month brand supplier_type time_key quarter type year supplier_key item_key branch_key branch location_key location branch_key units_sold branch_name location_key branch_type street city dollars_sold city_key city_key avg_sales city state Measures country Example of Fact Constellation Shipping Fact Table time item time_key time_key day item_key day_of_the_week item_key Sales Fact Table item_name month brand shipper_key time_key quarter type year supplier_type from_location item_key to_location branch_key branch dollars_cost location_key location branch_key units_shipped units_sold location_key branch_name branch_type street dollars_sold shipper city state avg_sales shipper_key country shipper_name Measures location_key shipper_type 9
11/16/2017 Cube Definition in DMQL Cube Definition (Fact Table) define cube < cube_name> [ < dimension_list> ] : < measure_list> Dim ension Definition (Dimension Table) define dimension < dimension_name> as (< attribute_or_dimension_list> ) Special Case (Shared Dimension Table) define dimension < dimension_name> as < dimension_name_first> in cube < cube_name_first> Star Schema Definition in DMQL Exam ple define cube sales [ time, item, branch, location] : dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(* ) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, state, country) 10
Recommend
More recommend