chapter 4 data w arehouse and olap operations
play

Chapter 4 , Data W arehouse and OLAP Operations Young-Rae Cho - PDF document

11/16/2017 CSI 4352, Introduction to Data Mining Chapter 4 , Data W arehouse and OLAP Operations Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Chapter 4 , Data W


  1. 11/16/2017 CSI 4352, Introduction to Data Mining Chapter 4 , Data W arehouse and OLAP Operations Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Chapter 4 , Data W arehouse & OLAP Operations  Basic Concept of Data W arehouse  Data W arehouse Modeling  Data W arehouse Architecture  Data W arehouse I m plem entation  From Data W arehousing to Data Mining 1

  2. 11/16/2017 What is Data Warehouse?  Data W arehouse ( defined in many different ways )  A decision support database that is maintained separately from the organization’s operational database  The support of information processing by providing a solid platform of consolidated, historical data for analysis  “A data warehouse is a (1) subject-oriented , (2) integrated , (3) time-variant , and (4) nonvolatile collection of data in support of management’s decision-making process.” — W. H. I nmon  Data W arehousing  The process of constructing and using data warehouses Data Warehouse – Subject-Oriented  Organized around major subjects  e.g., customers, products, sales  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing  Provide a simple and concise view around particular subject issues  Excluding data that are not useful in the decision support process 2

  3. 11/16/2017 Data Warehouse – Integrated  I ntegrating multiple, heterogeneous data sources  Relational databases, flat files, on-line transaction records  Apply data cleaning and data integration techniques  Ensures consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources Data Warehouse – Time Variant  The time horizon of data warehouses is significantly longer than that of operational systems  Operational databases have current data values  Data warehouses provide information from a historical perspective (e.g., 10-20 years)  Time is a key structure in data warehouses  Contain the attribute of time (explicitly or implicitly) 3

  4. 11/16/2017 Data Warehouse – Nonvolatile  A physically separate storage of data transformed from operational databases  Operational update of data does not occur  Not require transaction processing, recovery, and concurrency control mechanisms  Require only two operations, initial loading of data and access of data Data Integration Methods  Methods (1) Process to provide uniform interface to multiple data sources → Tradition Database I ntegration (2) Process to combine multiple data sources into coherent storage → Data warehousing  Traditional DB I ntegration  A query-driven approach  Wrappers / mediators on top of heterogeneous data sources  Data W arehousing  An update-driven approach  Combined the heterogeneous data sources in advance  Stored them in a warehouse for direct query and analysis 4

  5. 11/16/2017 OLTP vs. OLAP  OLTP ( on-line transaction processing)  Major task of traditional relational DBMS  Day-to-day operations: e.g., purchasing, inventory, manufacturing, banking, payroll, registration, accounting, etc.  OLAP ( on-line analytical processing)  Major task of data warehouse system  Data analysis and decision making  Distinct Features ( OLTP vs. OLAP)  User and system orientation (customers vs. market analysts)  Data contents (current, detailed vs. historical, consolidated)  Database design (ER + application vs. star + subject)  View (current, local vs. evolutionary, integrated) OLTP vs. OLAP Feature OLTP OLAP Characteristic operational processing information processing Orientation transaction analysis Users clerk, DBA knowledge worker (CEO, analyst) Function day-to-day operations decision support DB Design application-oriented subject-oriented Data current, up-to-date historical, integrated, summarized Unit of work short, simple transaction complex query Access read/write/update read-only (lots of scans) 5

  6. 11/16/2017 Why Data Warehouse?  Perform ance I ssue  DBMS: tuned for OLTP e.g., access methods, indexing, concurrency control, recovery  Data warehouse: tuned for OLAP e.g., complex queries, multidimensional view, consolidation  Data I ssue  Decision support requires historical data, consolidated and summarized data, consistent data CSI 4352, Introduction to Data Mining Chapter 4 , Data W arehouse & OLAP Operations  Basic Concept of Data W arehouse  Data W arehouse Modeling  Data W arehouse Architecture  Data W arehouse I m plem entation  From Data W arehousing to Data Mining 6

  7. 11/16/2017 Data Format for Warehouse  Dim ensions  Multi-dimensional data model  Data are stored in the form of a data cube  Data Cube  A view of multi-dimensions  Dimension tables, such as item (item_name, brand, type), or time (day, week, month, quarter, year)  Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables  Cuboid  Each combination of dimensional spaces in a data cube  0-D cuboid, 1-D cuboid, 2-D cuboid, … , n-D cuboid The Lattice of Cuboids all 0-D (apex) cuboid time item location supplier 1-D cuboids time,location item,location location,supplier time,item 2-D cuboids time,supplier item,supplier time,item,location time,location,supplier 3-D cuboids time,item,supplier item,location,supplier 4-D (base) cuboid time, item, location, supplier 7

  8. 11/16/2017 Conceptual Modeling  Key of Modeling Data W arehouses  Handling dimensions & measures  Exam ples  Star schema: A fact table in the middle connected to a set of dimension tables  Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake  Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Example of Star Schema time item time_key day item_key day_of_the_week Sales Fact Table item_name month brand time_key quarter type year supplier_type item_key branch_key branch location_key location branch_key units_sold location_key branch_name branch_type street dollars_sold city state avg_sales country Measures 8

  9. 11/16/2017 Example of Snowflake Schema time item time_key supplier day item_key day_of_the_week Sales Fact Table item_name supplier_key month brand supplier_type time_key quarter type year supplier_key item_key branch_key branch location_key location branch_key units_sold branch_name location_key branch_type street city dollars_sold city_key city_key avg_sales city state Measures country Example of Fact Constellation Shipping Fact Table time item time_key time_key day item_key day_of_the_week item_key Sales Fact Table item_name month brand shipper_key time_key quarter type year supplier_type from_location item_key to_location branch_key branch dollars_cost location_key location branch_key units_shipped units_sold location_key branch_name branch_type street dollars_sold shipper city state avg_sales shipper_key country shipper_name Measures location_key shipper_type 9

  10. 11/16/2017 Cube Definition in DMQL  Cube Definition (Fact Table)  define cube < cube_name> [ < dimension_list> ] : < measure_list>  Dim ension Definition (Dimension Table)  define dimension < dimension_name> as (< attribute_or_dimension_list> )  Special Case (Shared Dimension Table)  define dimension < dimension_name> as < dimension_name_first> in cube < cube_name_first> Star Schema Definition in DMQL  Exam ple  define cube sales [ time, item, branch, location] : dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(* )  define dimension time as (time_key, day, day_of_week, month, quarter, year)  define dimension item as (item_key, item_name, brand, type, supplier_type)  define dimension branch as (branch_key, branch_name, branch_type)  define dimension location as (location_key, street, city, state, country) 10

Recommend


More recommend