Data Warehouse Ali Kamandi Sharif University of Technology Spring 2007 kamandi@ce.sharif.edu
Part 1: Data Warehouse Concepts 2
Data, Data everywhere yet ... I can’t find the data I need � � data is scattered over the network many versions � I can’t understand the data I found � I can’t use the data I found � 3
Data Management � The Difficulties of managing Data : The amount of data increases exponentially with time • Data are scattered throughout organization • An ever-increasing amount of external data needs • Data security, quality, and integrity are critical • 4
Data Sources � Internal Data Sources � Personal Data � External Data Sources 5
What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin] 6
Why Data Warehousing? ������������� ������������� ���������������������� ���������������������� ����������� ����������� ��������������������� ��������������������� ������������������ ����������������� ������������������ ����������������� ���������������� ����������������������� ���������������� ����������������������� �������� �������� ������������������ ��������������� ������������������ ��������������� ������� ����������������� ���������������������� ������� ����������������� ���������������������� ������������������ �������������������� ������������������ �������������������� ����������������� ����������������� ���������������������� ���������������������� ���������������� ���������������� ������������ 7 ������������
Definition of data warehousing ������������� ��������� A data warehouse is a subject-oriented , integrated , time-variant and non-volatile collection of data in support of management’s decision making process. 8
Characteristics of a Data Warehouse Subject-Oriented. Data are organized by subject and contain information relevant for decision support only . Consistency. Data in different operational databases may be encoded differently . In the data warehouse, though, they will be coded in a consistent manner. Time variant. The data are kept for many years so that they can be used for trends, forecasting, and comparisons over time. Non-volatile. Data are not updated once entered into the warehouse. Multidimensional. Typically the data warehouse uses a multidimensional structure . 9
OLTP vs. OLAP OLTP OLAP Mostly updates � Mostly reads � Many small transactions � Queries long, complex � Mb-Tb of data � Gb-Tb of data � Raw data � Summarized, consolidated � data Clerical users � Decision-makers, analysts � Up-to-date data � as users Consistency, � recoverability critical 10
Part 2: Data Warehouse Design 11
Building a Data Warehouse 12
Relational and Multidimensional Database � Relational databases store data in two – dimensional tables. Multidimensional databases typically store data in arrays, which consist of at least three business dimension. 13
Relational data model based on a single structure of data values in a two � dimensional table CUSTOMER ORDER Ord_no Ord_date Cus_id … Cus_id Cus_name … 001 Robert … 01 02 Dec 02 002 … 002 Lyn … 02 03 Dec 02 Lyn … … … … … … … … 14
A Sample Data Cube Total annual sales Date of TV in U.S.A. Product 2Qtr 1Qtr 3Qtr 4Qtr sum TV U.S.A PC VCR Country sum Canada Mexico sum 15
Multidimensional Data Model � Composed of one fact table and a set of dimension tables . � Dimensional table : each dimension table has a simple table (non-composite) primary key that corresponds exactly to one of the components of the composite key in the fact table. � A multidimensional data model is typically organized around a central theme, like sales, for instance. 16
Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected to a set � of dimension tables � Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension � tables, viewed as a collection of stars, therefore called 17 galaxy schema or fact constellation
Example of Star Schema time item time_key day item_key day_of_the_week Sales Fact Table item_name month brand time_key quarter type year supplier_type item_key branch_key location branch location_key location_key branch_key street units_sold branch_name city branch_type dollars_sold province_or_street country avg_sales Measures 18
Example of Snowflake Schema time item time_key item_key day supplier Sales Fact Table day_of_the_week item_name supplier_key brand month supplier_type time_key type quarter supplier_key year item_key branch_key location branch location_key location_key branch_key street units_sold branch_name city_key city branch_type dollars_sold city_key avg_sales city province_or_street Measures country 19
Example of Fact Constellation time item Shipping Fact Table time_key day item_key day_of_the_week time_key Sales Fact Table item_name month brand item_key quarter time_key type year supplier_type shipper_key item_key from_location branch_key to_location branch location_key location branch_key dollars_cost location_key units_sold branch_name street units_shipped branch_type dollars_sold city province_or_street avg_sales country shipper Measures shipper_key shipper_name 20 location_key shipper_type
Typical OLAP Operations Roll up (drill-up): summarize data � � by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up � � from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: � � project and select Pivot (rotate): � � reorient the cube, visualization, 3D to series of 2D planes. Other operations � 21
Operations � Rollup: summarize data � e.g., given sales data, summarize sales for last year by product category and region � Drill down: get more details � e.g., given summarized sales as above, find breakup of sales by city within each region, or within the Andhra region 22
More Cube Operations � Slice and dice: select and project � e.g.: Sales of soft-drinks in Andhra over the last quarter � Pivot: change the view of data 23
Design a Warehouse? � Design data warehouse Design data marts � � Design representation (Star schema, …) gathering data � Which data is needed? � Where does it come from? � cleansing, integrating, ... � � querying, reporting, analysis monitoring, administering warehouse � � data mining 24
Data Gathering � Periodic snapshots � Database triggers � Log shipping � Data shipping (replication service) � … 25
Loading Data � Incremental vs. refresh � Off-line vs. on-line � Frequency of loading � At night, 1x a week/month, continuously � Parallel/Partitioned load 26
Part 3: Data Mining 27
Data Mining Concepts Data mining: The process of searching for valuable business information in a large database, data warehouse, or data mart. 28
Text Mining The application of data mining to non- structured or less-structured text files. 29
Web Mining The application of data mining techniques to discover actionable and meaningful patterns form web resources. 30
Some basic operations � Regression � Classification � Clustering / similarity matching � Association rules and variants 31
Recommend
More recommend