Introduction to Data Warehousing and Business Intelligence Slides adapted from Torben Bach Pedersen
Course Structure • Business intelligence Extract knowledge from large amounts of data collected in a � modern enterprise Data warehousing, machine learning � • Purpose Acquire theoretical background in lectures and literature studies � Obtain practical experience on (industrial) tools in a mini-project � Data warehousing : construction of a database with only data analysis purpose Business Machine learning : find Intelligence (BI) patterns automatically in databases Aalborg University 2008 - DWML course 2
Contact Information • Data warehousing � Teacher: Ken (Man Lung YIU) � Office: 3.2.48 Email: mly@cs.aau.dk • Machine learning � Teacher: Thomas D. Nielsen � Office: 2.2.03 Email: tdn@cs.aau.dk • Course homepage: http://www.cs.aau.dk/~tdn/Teaching/DWML08/ � Lecture slides, mini-project, …… Aalborg University 2008 - DWML course 3
Literature for Data Warehousing • No textbook • Books (selected pages available in the class) � The Data Warehouse Lifecycle Toolkit , Kimball et. al., Wiley 1998 � Fundamentals of Data Warehousing , Jarke et. al., Springer Verlag 2003 • Additional references/articles: � To be posted at course homepage Aalborg University 2008 - DWML course 4
Mini-Project and Exam • Mini-project � Performed in groups of ~3 persons � Documented in report of 20 pages � Firm Deadline: April 20 ◆ The homepage also shows the soft deadline of each task • Exam (information from last year) � Individual oral exam, for 20 minutes ◆ 8 minutes of DW questions ◆ 8 minutes of ML questions � Mini-project report as the basis for discussion � Exam also covers theoretical background in lectures and literature � More details at the end of the course Aalborg University 2008 - DWML course 5
Overview • Why Business Intelligence? • Data analysis problems • Data Warehouse (DW) introduction • DW Topics � Multidimensional modeling � ETL � Performance optimization Aalborg University 2008 - DWML course 6
What is Business Intelligence (BI)? • BI is different from Artificial Intelligence (AI) � AI systems make decisions for the users � BI systems help the users make the right decisions, based on available data • Combination of technologies � Data Warehousing (DW) � On-Line Analytical Processing (OLAP) � Data Mining (DM) � …… Aalborg University 2008 - DWML course 7
Why BI Important? • Worldwide BI revenue in 2005 = US$ 5.7 billion � 10% growth each year • The Web makes BI more necessary � Customers do not appear “physically” in the store � Customers can change to other stores more easily • Thus: � Know your customers using data and BI! � Utilize Web logs, analyze customer behavior in more detail than before (e.g., what was not bought?) � Combine web data with traditional customer data Aalborg University 2008 - DWML course 8
Data Analysis Problems • The same data found in many different systems � Example: customer data across different departments � The same concept is defined differently • Heterogeneous sources � Relational DBMS, On-Line Transaction Processing (OLTP) � Unstructured data in files (e.g., MS Excel) and documents (e.g., MS Word) Aalborg University 2008 - DWML course 9
Data Analysis Problems (cont’) • Data is suited for operational systems � Accounting, billing, etc. � Do not support analysis across business functions • Data quality is bad � Missing data, imprecise data, different use of systems • Data are “volatile” � Data deleted in operational systems (6 months) � Data change over time – no historical information Aalborg University 2008 - DWML course 10
Data Warehousing • Solution: new analysis environment (DW) where data are � Subject oriented (versus function oriented) � Integrated (logically and physically) � Time variant (data can always be related to time) � Stable (data not deleted, several versions) � Supporting management decisions (different organization) • A good DW is a prerequisite for successful BI Aalborg University 2008 - DWML course 11
DW: Purpose and Definition • DW is a store of information organized in a unified data model • Data collected from a number of different sources � Finance, billing, web logs, personnel, … • Purpose of a data warehouse (DW): support decision making • Easy to perform advanced analysis � Ad-hoc analysis and reports ◆ We will cover this soon …… � Data mining: discovery of hidden patterns and trends Aalborg University 2008 - DWML course 12
DW Architecture – Data as Materialized Views Existing databases and systems (OLTP) New databases and systems (OLAP) Appl. OLAP DM DB Appl. Data DB DM mining Trans. DW Appl. DB (Global) Data Warehouse Appl. Visua- lization DM DB (Local) Appl. Data Marts DB Analogy: (data) suppliers ↔ warehouse ↔ (data) consumers Aalborg University 2008 - DWML course 13
Function- vs. Subject Orientation Function-oriented systems Subject-oriented systems Appl. D-Appl. DM DB Appl. D-Appl. DB DM Trans. DW Appl. DB All subjects, integrated Appl. D-Appl. DM DB Selected Appl. subjects DB Aalborg University 2008 - DWML course 14
Central DW Architecture Clients • All data in one, central DW • All client queries directly on the central DW • Pros Simplicity � Easy to manage Central � • Cons DW Bad performance due to no � redundancy/ workload distribution Source Source Aalborg University 2008 - DWML course 15
Federated DW Architecture Clients • Data stored in separate data marts, aimed at special departments • Logical DW (i.e., virtual) Finance Distr. Mrktng • Data marts contain detail data mart mart mart • Pros Performance due to distribution � • Cons Logical More complex � DW Source Source Aalborg University 2008 - DWML course 16
Tiered Architecture • Central DW is materialized • Data is distributed to data marts in one or more tiers Milk� 56� 67� Bread� • Only aggregated data in cube tiers Aalborg� 57� 45� 211� Copenhagen� 123� 127� • Data is aggregated/reduced as it 2000� 2001� Milk� 56� 67� Bread� moves through tiers Aalborg� 57� 45� 211� • Pros Copenhagen� 123� 127� Milk� 56� 67� Bread� 2000� 2001� Central Aalborg� 57� 45� 211� Best performance due to � Copenhagen� 123� 127� DW Milk� 56� 67� redundancy and distribution 2000� 2001� Bread� Milk� 56� 67� Aalborg� 57� 45� Bread� 211� • Cons Aalborg� 57� 45� 211� Copenhagen� 123� 127� Copenhagen� 123� 127� 2000� 2001� Most complex 2000� 2001� � Hard to manage � Aalborg University 2008 - DWML course 17
Queries Hard/Infeasible for OLTP • Business analysis � In the past five years , which product is the most profitable? � Which public holiday we have the largest sales? � Which week we have the largest sales? � Does the sales of dairy products increase over time? • Difficult to represent these queries by using SQL � 3 rd query: extract the “week” value using a function ◆ But the user has to learn many transformation functions … � 4 th query: use a “special” table to store IDs of all dairy products, in advance ◆ We have many other product types as well … • The need of multidimensional modeling Aalborg University 2008 - DWML course 18
Multidimensional Modeling • Example: sales of supermarkets • Facts and measures � Each sales record is a fact, and its sales value is a measure • Dimensions � Each sales record is associated with its values of Product , Store , Time � Correlated attributes grouped into the same dimension � easier for analysis tasks Product Type Category Store City County Day Month Year Sales Top Beer Beverage Trøjborg Århus Århus 25 Maj 1997 5.75 Product Store Time Aalborg University 2008 - DWML course 19
Multidimensional Modeling • How do we model the Time dimension? � A tree structure, with multiple levels � Attributes, e.g., holiday, event tid day week month year work … T day Year Week 1 1 1 1 2008 No … Month 2 2 1 1 2008 Yes … Day … … … … … … … • Advantage of this model? � Easy for query (more about this later) • Disadvantage? � Data redundancy (controlled redundancy is acceptable) Aalborg University 2008 - DWML course 20
OLTP vs. OLAP OLTP OLAP Target operational needs business analysis Data small, operational data large, historical data Model normalized denormalized/ multidimensional Query language SQL not unified Queries small large Updates frequent and small infrequent and batch Transactional recovery necessary not necessary Optimized for update operations query operations Aalborg University 2008 - DWML course 21
Recommend
More recommend