Summary of Last Chapter Principles of Knowledge Discovery in Data • What is a data warehouse and what is it for? Fall 2004 • What is the multi-dimensional data model? Chapter 3: Data Preprocessing • What is the difference between OLAP and OLTP? • What is the general architecture of a data warehouse? Dr. Osmar R. Zaïane • How can we implement a data warehouse? • Are there issues related to data cube technology? • Can we mine data warehouses? University of Alberta Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 1 2 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta Course Content Chapter 3 Objectives • Introduction to Data Mining • Data warehousing and OLAP Realize the importance of data preprocessing • Data cleaning for real world data before data mining or • Data mining operations construction of data warehouses. • Data summarization • Association analysis • Classification and prediction Get an overview of some data preprocessing • Clustering issues and techniques. • Web Mining • Similarity Search • Other topics if time permits Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 3 4 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta
Motivation Data Preprocessing Outline In real world applications data can be inconsistent, incomplete and/or noisy. • What is the motivation behind data preprocessing? Errors can happen : • What is data cleaning and what is it for? • Faulty data collection instruments • Data entry problems • What is data integration and what is it for? • Human misjudgment during data entry • Data transmission problems • What is data transformation and what is it for? • Technology limitations • Discrepancy in naming conventions • What is data reduction and what is it for? Results : • What is data discretization? • Duplicated records • Incomplete data • How do we generate concept hierarchies? • Contradictions in data Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 5 6 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta Motivation (Con’t) Data Preprocessing Data Warehouse Data Cleaning Data Mining Data Integration Decision Data What happens when the data can not be trusted? Can the decision be trusted? Decision making is jeopardized. Data Transformation Better chance to discover useful Data Reduction knowledge when data is clean. Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 7 8 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta
Data Cleaning Data Preprocessing Outline Real-world application data can be incomplete, • What is the motivation behind data preprocessing? noisy, and inconsistent. • What is data cleaning and what is it for? No recorded values for some attributes Not considered at time of entry • What is data integration and what is it for? Random errors • What is data transformation and what is it for? … Data cleaning attempts to: • What is data reduction and what is it for? • Fill in missing values • What is data discretization? • Smooth out noisy data • How do we generate concept hierarchies? • Correct inconsistencies Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 9 10 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta Solving Missing Data Smoothing Noisy Data The purpose of data smoothing is to eliminate noise. This can be done by: • Ignore the tuple with missing values; • Fill in the missing values manually; • Binning • Use a global constant to fill in missing values (NULL, unknown, etc.); • Clustering • Use the attribute value mean to filling missing values of that • Regression attribute; y Data regression consists of fitting the data to • Use the attribute mean for all samples belonging to the same Y1 a function. A linear regression for instance, class to fill in the missing values; finds the line to fit 2 variables so that one Y1’ y = x + 1 variable can predict the other. • Infer the most probable value to fill in the missing value. More variables can be involved in a multiple X1 x linear regression . Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 11 12 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta
Binning Clustering Binning smoothes the data by consulting the value’s neighbourhood. Data is organized into groups of “similar” values. First, the data is sorted to get the values “in their neighbourhoods”. Rare values that fall outside these groups are Second, the data is distributed in equi-width bins: considered outliers and are discarded. Ex : 4, 8, 15, 21, 21, 24, 25, 28, 34 Bins of depth 3: Bin1: 4, 8, 15 Third, process local smoothing. Bin2: 21, 21, 24 Bin3: 25, 28, 34 Smoothing by bin median Smoothing by bin means Smoothing by bin boundaries Bin1: 9, 9, 9 Bin1: 4, 4, 15 Bin2: 22, 22, 22 Bin2: 21, 21, 24 Bin3: 29, 29, 29 Bin3: 25, 25, 34 Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 13 14 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta Data Integration Data Preprocessing Outline Data analysis may require a combination of data from multiple sources into a coherent data store. • What is the motivation behind data preprocessing? • What is data cleaning and what is it for? There are many challenges : •Schema integration: CID ≈ C_number ≈ Cust-id ≈ cust# • What is data integration and what is it for? •Semantic heterogeneity • What is data transformation and what is it for? •Data value conflicts (different representations or scales, etc.) •Redundant records • What is data reduction and what is it for? •Redundant attributes (redundant if it can be derived from other attributes) •Correlation analysis P(A ∧ B)/(P(A)P(B)) • What is data discretization? 1: independent, >1 positive correlation, <1: negative correlation. • How do we generate concept hierarchies? Metadata is often necessary Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 15 16 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta
Data Transformation Data Preprocessing Outline Data is sometimes in a form not appropriate for mining. Either the algorithm at hand can not handle it, the form • What is the motivation behind data preprocessing? of the data is not regular, or the data itself is not specific • What is data cleaning and what is it for? enough. • What is data integration and what is it for? • What is data transformation and what is it for? • Normalization (to compare carrots with carrots) • Smoothing • What is data reduction and what is it for? • Aggregation (summary operation applied to data) • What is data discretization? • Generalization (low level data is replaced with level data – concept hierarchy) • How do we generate concept hierarchies? Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 17 18 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta Normalization Data Preprocessing Outline Min-max normalization : linear transformation from v to v’ v’= v-min/(max – min) (newmax – newmin) + newmin • What is the motivation behind data preprocessing? Ex : transform $30000 between [10000..45000] into [0..1] � 30-10/35(1)+0=0.514 • What is data cleaning and what is it for? Zscore normalization : normalization v into v’ based on attribute value • What is data integration and what is it for? mean and standard deviation v’=v-Mean/StandardDeviation • What is data transformation and what is it for? Normalization by decimal scaling : moves the decimal point of v by j • What is data reduction and what is it for? positions such that j is the minimum number of positions moved to the decimal of the absolute maximum value to make is fall in [0..1]. • What is data discretization? v’=v/10 j • How do we generate concept hierarchies? Ex : if v ranges between –56 and 9976, j=4 � v’ ranges between –0.0056 and 0.9976 Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 19 20 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta
Recommend
More recommend