Data Mining Data preprocessing Hamid Beigy Sharif University of - PowerPoint PPT Presentation

Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 15

Table of contents Introduction 1 Data preprocessing 2 Data cleaning 3 Data integration 4 Data transformation 5 Reading 6 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 2 / 15

Outline Introduction 1 Data preprocessing 2 Data cleaning 3 Data integration 4 Data transformation 5 Reading 6 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 15

Data mining process Real-world data bases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources. Low-quality data will lead to low-quality mining results. Data have quality if they satisfy the requirements of the intended use. Factors comprising data quality are Accuracy (Does not contain errors) Completeness (All interesting attributes are filled). Consistency Timeliness Believability Interpretability Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 15

Data preprocessing How can the data be preprocessed in order to improve the quality of the data and, consequently, of the mining results? How can the data be preprocessed so as to improve the efficiency and ease of the mining process? There are several data preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in data. Data integration merges data from multiple sources into a coherent data store such as a data warehouse. Data reduction can reduce data size by aggregating, eliminating redundant features, or clustering. Data transformations (e.g., normalization) may be applied, where data are scaled to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and efficiency of mining algorithms involving distance measurements. These techniques are not mutually exclusive; they may work together. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 15

Data cleaning Data cleaning routines attempt to clean the data by Fill in missing values. Smooth out noisy data Identifying or removing outliers Correct inconsistencies in the data (For ex. the attribute for customer identification may be referred at as customer-id in one data store and cust-id in another one. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 15

Filling missing values In real-world data, many tuples have no recorded value for several attributes. How can you go about filling in the missing values for this attribute? Ignore the tuple Fill in the missing value manually Use a global constant to fill in the missing value such as unknown and ∞ . Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value. Use the attribute mean or median for all samples belonging to the same class as the given tuple Use the most probable value to fill in the missing value (using regression, inference-based tools using a Bayesian formalism, or decision tree). Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 15

Smooth out noisy data What is noise? Noise is a random error or variance in a measured variable. Given a numeric attribute. How can we smooth out the data to remove the noise? Binning: Binning methods smooth a sorted data value by consulting its neighborhood. Data partitioning equal-frequency versus equal-width smoothing methods smoothing by bin means versus bin medians and bin boundaries Sorted data for price (in dollars) : 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into (equal-frequency) bins : Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34 Smoothing by bin means : Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Smoothing by bin boundaries : Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 15

Smooth out noisy data (cont.) How can we smooth out the data to remove the noise? Regression: Data smoothing can also be done by regression. Outlier analysis: Outliers may be detected by clustering. Intuitively, values that fall outside of the set of clusters may be considered outliers. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 15

Data cleaning as a process Missing values, noise, and inconsistencies contribute to inaccurate data. Data cleaning process Discrepancy detection Discrepancies can be caused by several factors including Poorly designed data entry forms with many optional fields Human error in data entry Data decay (e.g., outdated addresses) Inconsistent data representation Inconsistent use of codes Error in instrumentation devices As a starting point, use any domain knowledge, for example date format. Data should also be examined regarding uniqe-rule (Attribute values most be unique) Data should also be examined regarding consecutive-rule (no missing values between the lowest and highest values for the attribute, and that all values must also be unique) Data should also be examined regarding null-rule (specifies the use of blanks, question marks, special characters, and how such values should be handled) Some data inconsistencies may be corrected manually using external references (ex. using a paper trace) Most errors will require data transformation (define and apply a series of transformations to correct the given attribute) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 15

Data integration Data mining often requires data integration (the merging of data from multiple data stores). Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help to improve the accuracy and speed of the subsequent data mining process. Issues in data integration Entity identification Schema integration and object matching can be tricky. Redundancy and correlation analysis An attribute (such as annual revenue, for instance) may be redundant if it can be derived from another attribute or set of attributes. Tuple duplication Two or more records may refer to the same object. Data value conflict detection and resolution For the same real-world entity, attribute values from different sources may differ (ex. telephone no.). This may be due to differences in representation, scaling, or encoding. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 15

Data reduction The given dataset may be huge and data analysis may take a long time. Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. Data reduction Attributes Attributes A1 A2 A3 ... A126 A1 A3 ... A115 Transactions T1 T1 Transactions T2 T4 T3 ... T4 T1456 ... T2000 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 11 / 15

Data reduction (cont.) Data reduction strategies Dimensionality reduction This is the process of reducing the number of attributes under consideration. Feature extraction (PCA, MDS, ...) Feature selection Numerosity reduction These techniques replace the original data volume by alternative and smaller form of data representation. Linear regression Histograms clustering sampling Data cube aggregation Data compression In data compression, transformations are applied so as to obtain a reduced or compressed representation of the original data. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 15

Data transformation In this step, the data are transformed or consolidated so that the resulting mining process may be more efficient, and the patterns found may be easier to understand. Data transformation strategies Smoothing (binning, regression, clustering) Attribute contraction (new attributes are constructed to help the mining process) Aggregation Normalization (min-max normalization, ...) Discretization (binning, histogram, decision tree, clustering) Concept hierarchy generation for nominal data Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 13 / 15

Data Mining Data preprocessing Hamid Beigy Sharif University of - PowerPoint PPT Presentation

Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 15 Table of contents Introduction 1 Data preprocessing 2 Data cleaning 3

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Quantified Sentences as a Window into Prediction and Priming: An ERP Study Aniello De Santo Jon

Money illusion Jean-Robert Tyran U Vienna Outline Introduction Evidence Surveys

SCATTERPLOTS: TASKS, DATA AND DESIGN A. Sarikaya and M. Gleicher Presented By: IEEE Transaction

Prototype Selection Using Polyhedron Curvature Benyamin Ghojogh, Fakhri Karray, Mark Crowley

Pensions and the role of the State Massimo DAntoni Dept of Economics and Statistics,

Foreign Direct Investments in Africa: Are Chinese Investors Different? Luigi Benfratello 1 , Anna

Attention, Psychological Bias, and Social Interactions David Hirshleifer Finance Theory Group

INSERT TITLE SLIDE HERE Social skills Common social skill features in students with autism