Summary of Last Chapter Principles of Knowledge Discovery in Data - PowerPoint PPT Presentation

Summary of Last Chapter Principles of Knowledge Discovery in Data • What is a data warehouse and what is it for? Fall 2004 • What is the multi-dimensional data model? Chapter 3: Data Preprocessing • What is the difference between OLAP and OLTP? • What is the general architecture of a data warehouse? Dr. Osmar R. Zaïane • How can we implement a data warehouse? • Are there issues related to data cube technology? • Can we mine data warehouses? University of Alberta  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 1 2 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta Course Content Chapter 3 Objectives • Introduction to Data Mining • Data warehousing and OLAP Realize the importance of data preprocessing • Data cleaning for real world data before data mining or • Data mining operations construction of data warehouses. • Data summarization • Association analysis • Classification and prediction Get an overview of some data preprocessing • Clustering issues and techniques. • Web Mining • Similarity Search • Other topics if time permits  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 3 4 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

Motivation Data Preprocessing Outline In real world applications data can be inconsistent, incomplete and/or noisy. • What is the motivation behind data preprocessing? Errors can happen : • What is data cleaning and what is it for? • Faulty data collection instruments • Data entry problems • What is data integration and what is it for? • Human misjudgment during data entry • Data transmission problems • What is data transformation and what is it for? • Technology limitations • Discrepancy in naming conventions • What is data reduction and what is it for? Results : • What is data discretization? • Duplicated records • Incomplete data • How do we generate concept hierarchies? • Contradictions in data  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 5 6 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta Motivation (Con’t) Data Preprocessing Data Warehouse Data Cleaning Data Mining Data Integration Decision Data What happens when the data can not be trusted? Can the decision be trusted? Decision making is jeopardized. Data Transformation Better chance to discover useful Data Reduction knowledge when data is clean.  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 7 8 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

Data Cleaning Data Preprocessing Outline Real-world application data can be incomplete, • What is the motivation behind data preprocessing? noisy, and inconsistent. • What is data cleaning and what is it for? No recorded values for some attributes Not considered at time of entry • What is data integration and what is it for? Random errors • What is data transformation and what is it for? … Data cleaning attempts to: • What is data reduction and what is it for? • Fill in missing values • What is data discretization? • Smooth out noisy data • How do we generate concept hierarchies? • Correct inconsistencies  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 9 10 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta Solving Missing Data Smoothing Noisy Data The purpose of data smoothing is to eliminate noise. This can be done by: • Ignore the tuple with missing values; • Fill in the missing values manually; • Binning • Use a global constant to fill in missing values (NULL, unknown, etc.); • Clustering • Use the attribute value mean to filling missing values of that • Regression attribute; y Data regression consists of fitting the data to • Use the attribute mean for all samples belonging to the same Y1 a function. A linear regression for instance, class to fill in the missing values; finds the line to fit 2 variables so that one Y1’ y = x + 1 variable can predict the other. • Infer the most probable value to fill in the missing value. More variables can be involved in a multiple X1 x linear regression .  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 11 12 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

Binning Clustering Binning smoothes the data by consulting the value’s neighbourhood. Data is organized into groups of “similar” values. First, the data is sorted to get the values “in their neighbourhoods”. Rare values that fall outside these groups are Second, the data is distributed in equi-width bins: considered outliers and are discarded. Ex : 4, 8, 15, 21, 21, 24, 25, 28, 34 Bins of depth 3: Bin1: 4, 8, 15 Third, process local smoothing. Bin2: 21, 21, 24 Bin3: 25, 28, 34 Smoothing by bin median Smoothing by bin means Smoothing by bin boundaries Bin1: 9, 9, 9 Bin1: 4, 4, 15 Bin2: 22, 22, 22 Bin2: 21, 21, 24 Bin3: 29, 29, 29 Bin3: 25, 25, 34  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 13 14 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta Data Integration Data Preprocessing Outline Data analysis may require a combination of data from multiple sources into a coherent data store. • What is the motivation behind data preprocessing? • What is data cleaning and what is it for? There are many challenges : •Schema integration: CID ≈ C_number ≈ Cust-id ≈ cust# • What is data integration and what is it for? •Semantic heterogeneity • What is data transformation and what is it for? •Data value conflicts (different representations or scales, etc.) •Redundant records • What is data reduction and what is it for? •Redundant attributes (redundant if it can be derived from other attributes) •Correlation analysis P(A ∧ B)/(P(A)P(B)) • What is data discretization? 1: independent, >1 positive correlation, <1: negative correlation. • How do we generate concept hierarchies? Metadata is often necessary  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 15 16 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

Data Transformation Data Preprocessing Outline Data is sometimes in a form not appropriate for mining. Either the algorithm at hand can not handle it, the form • What is the motivation behind data preprocessing? of the data is not regular, or the data itself is not specific • What is data cleaning and what is it for? enough. • What is data integration and what is it for? • What is data transformation and what is it for? • Normalization (to compare carrots with carrots) • Smoothing • What is data reduction and what is it for? • Aggregation (summary operation applied to data) • What is data discretization? • Generalization (low level data is replaced with level data – concept hierarchy) • How do we generate concept hierarchies?  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 17 18 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta Normalization Data Preprocessing Outline Min-max normalization : linear transformation from v to v’ v’= v-min/(max – min) (newmax – newmin) + newmin • What is the motivation behind data preprocessing? Ex : transform $30000 between [10000..45000] into [0..1] � 30-10/35(1)+0=0.514 • What is data cleaning and what is it for? Zscore normalization : normalization v into v’ based on attribute value • What is data integration and what is it for? mean and standard deviation v’=v-Mean/StandardDeviation • What is data transformation and what is it for? Normalization by decimal scaling : moves the decimal point of v by j • What is data reduction and what is it for? positions such that j is the minimum number of positions moved to the decimal of the absolute maximum value to make is fall in [0..1]. • What is data discretization? v’=v/10 j • How do we generate concept hierarchies? Ex : if v ranges between –56 and 9976, j=4 � v’ ranges between –0.0056 and 0.9976  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 19 20 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

Summary of Last Chapter Principles of Knowledge Discovery in Data - PowerPoint PPT Presentation

Summary of Last Chapter Principles of Knowledge Discovery in Data What is a data warehouse and what is it for? Fall 2004 What is the multi-dimensional data model? Chapter 3: Data Preprocessing What is the difference between OLAP

Summary of Last Chapter Principles of Knowledge Discovery in Databases What kind of

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Summary of Last Chapter Principles of Knowledge Discovery in Data What is the motivation for

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Principles of Knowledge Discovery in Data Fall 2002 Dr. Osmar R. Zaane University of Alberta

Knowledge discovery in large Knowledge discovery in large biological data sets using hybrid

Proposals for Proposals for principles of knowledge principles of knowledge engineering

Where are we? Knowledge Engineering In the last few lectures . . . Semester 2, 2004-05

Where are we? Knowledge Engineering Last time . . . Semester 2, 2004-05 we attempted a

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

Plan for today Knowledge-based systems 1 Explicit knowledge Knowledge Representation Inferred

Plan for today Knowledge-based systems 1 Tacit knowledge Knowledge Representation Inferred

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

Watson Discovery Spring 2020 Discovery pipeline Using NLU, document conversion, and UI tools

Tunnel End-point Discovery Tunnel End-point Discovery draft-palet-v6ops-tun-auto-disc-03.txt

VPN Discovery VPN Discovery Design Team Discussions and Options Design Team Discussions and

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David

Storage Ring Measurement of the Proton Electric Dipole Moment Richard Talman Laboratory for

3515ICT Theory of Computation Computational Complexity (Based loosely on slides by Harald

Fast reduction in the algebraic de Rham cohomology of projective hypersurfaces Sebastian Pancratz

Correct rounding of transcendental functions: an approach via Euclidean lattices and approximation

IntroductionToVerilogfor Combinational*Logic

7/31/2018 IDEA PART B & Preschool Application Instructions 101 Anthony Mukuna, CPA Special

EU Exit & Chemicals Regulation 5 December 2018 Preparations for EU Exit Joint Defra/HSE

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Summary of Last Chapter Principles of Knowledge Discovery in Data - PowerPoint PPT Presentation

Summary of Last Chapter Principles of Knowledge Discovery in Data What is a data warehouse and what is it for? Fall 2004 What is the multi-dimensional data model? Chapter 3: Data Preprocessing What is the difference between OLAP

Summary of Last Chapter Principles of Knowledge Discovery in Databases What kind of

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Summary of Last Chapter Principles of Knowledge Discovery in Data What is the motivation for

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Principles of Knowledge Discovery in Data Fall 2002 Dr. Osmar R. Zaane University of Alberta

Knowledge discovery in large Knowledge discovery in large biological data sets using hybrid

Proposals for Proposals for principles of knowledge principles of knowledge engineering

Where are we? Knowledge Engineering In the last few lectures . . . Semester 2, 2004-05

Where are we? Knowledge Engineering Last time . . . Semester 2, 2004-05 we attempted a

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

Plan for today Knowledge-based systems 1 Explicit knowledge Knowledge Representation Inferred

Plan for today Knowledge-based systems 1 Tacit knowledge Knowledge Representation Inferred

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

Watson Discovery Spring 2020 Discovery pipeline Using NLU, document conversion, and UI tools

Tunnel End-point Discovery Tunnel End-point Discovery draft-palet-v6ops-tun-auto-disc-03.txt

VPN Discovery VPN Discovery Design Team Discussions and Options Design Team Discussions and

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David

Storage Ring Measurement of the Proton Electric Dipole Moment Richard Talman Laboratory for

3515ICT Theory of Computation Computational Complexity (Based loosely on slides by Harald

Fast reduction in the algebraic de Rham cohomology of projective hypersurfaces Sebastian Pancratz

Correct rounding of transcendental functions: an approach via Euclidean lattices and approximation

Introduction*To*Verilog*for* Combinational*Logic

7/31/2018 IDEA PART B &amp; Preschool Application Instructions 101 Anthony Mukuna, CPA Special

EU Exit &amp; Chemicals Regulation 5 December 2018 Preparations for EU Exit Joint Defra/HSE

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

IntroductionToVerilogfor Combinational*Logic

7/31/2018 IDEA PART B & Preschool Application Instructions 101 Anthony Mukuna, CPA Special

EU Exit & Chemicals Regulation 5 December 2018 Preparations for EU Exit Joint Defra/HSE