Data Preprocessing • Why preprocess the data? Data Preparation • Data cleaning • Discretization (Data preprocessing) • Data integration and transformation • Data reduction, Feature selection 2 Why Prepare Data? Why Prepare Data? • Preparing data also prepares the miner so that when • Some data preparation is needed for all mining tools using prepared data the miner produces better models, faster • The purpose of preparation is to transform data sets so that their information content is best exposed to • GIGO - good data is a prerequisite for producing the mining tool effective models of any type • Error prediction rate should be lower (or the same) • Some techniques are based on theoretical after the preparation as before it considerations, while others are rules of thumb based on experience 3 4
Major Tasks in Data Preprocessing Why Prepare Data? • Data cleaning • Data in the real world is dirty • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • incomplete: lacking attribute values, lacking certain • Data discretization attributes of interest, or containing only aggregate data • Part of data reduction but with particular importance, especially for numerical data • e.g., occupation=“” • Data integration • noisy: containing errors or outliers • Integration of multiple databases, data cubes, or files • Data transformation • e.g., Salary=“-10”, Age=“222” • inconsistent: containing discrepancies in codes or names • Normalization and aggregation • Data reduction • e.g., Age=“42” Birthday=“03/07/1997” • Obtains reduced representation in volume but produces the same or similar analytical • e.g., Was rating “1,2,3”, now rating “A, B, C” results • e.g., discrepancy between duplicate records 5 6 Data Preparation as a step in the Types of Data Measurements Knowledge Discovery Process Knowledge Evaluation and Presentation • Measurements differ in their nature and the Data preprocessing Data Mining amount of information they give Selection and Transformation • Qualitative vs. Quantitative Cleaning and DW Integration DB 7 8
Types of Measurements Types of Measurements • Nominal scale • Nominal scale • Gives unique names to objects - no other information deducible • Categorical scale • Names of people • Names categories of objects • Although maybe numerical, not ordered • ZIP codes • Hair color • Gender: Male, Female • Marital Status: Single, Married, Divorcee, Widower 9 10 Types of Measurements Types of Measurements • Nominal scale • Nominal scale • Categorical scale • Categorical scale • Ordinal scale • Ordinal scale • Measured values can be ordered naturally • Interval scale • Transitivity: (A > B) and (B > C) ⇒ (A > C) • The scale has a means to indicate the distance that separates • “blind” tasting of wines measured values • Classifying students as: Very, Good, Good Sufficient,... • Temperature • Temperature: Cool, Mild, Hot 11 12
Types of Measurements Types of Measurements • Nominal scale • Nominal scale More information content • Categorical scale • Categorical scale Qualitative • Ordinal scale • Ordinal scale • Interval scale • Interval scale Quantitative • Ratio scale • Ratio scale • measurement values can be used to determine a meaningful ratio between them Discrete or Continuous • Bank account balance • Weight • Salary 13 14 Data Preprocessing Data Cleaning • Why preprocess the data? • Data cleaning tasks • Data cleaning • Deal with missing values • Discretization • Identify outliers and smooth out noisy data • Correct inconsistent data • Data integration and transformation • Data reduction 15 16
Definitions Missing Data • Data is not always available • Missing value - not captured in the data set: errors in feeding, transmission, ... • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • Empty value - no value in the population • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • Outlier - out-of-range value • certain data may not be considered important at the time of entry • not register history or changes of the data • Missing data may need to be inferred. • Missing values may carry some information content: e.g. a credit application may carry information by noting which field the applicant did not complete 17 18 Missing Values How to Handle Missing Data? • There are always MVs in a real dataset • Ignore records (use only cases with all values) • Usually done when class label is missing as most prediction methods • MVs may have an impact on modelling, in fact, they can destroy it! do not handle missing data well • Some tools ignore missing values, others use some metric to fill in • Not effective when the percentage of missing values per attribute replacements varies considerably as it can lead to insufficient and/or biased sample sizes • The modeller should avoid default automated replacement techniques • Ignore attributes with missing values • Difficult to know limitations, problems and introduced bias • Use only features (attributes) with all values (may leave out important features) • Replacing missing values without elsewhere capturing that information removes information from the dataset • Fill in the missing value manually • tedious + infeasible? 19 20
How to Handle Missing Data? How to Handle Missing Data? • Use a global constant to fill in the missing value • Use the most probable value to fill in the missing value • e.g., “unknown”. (May create a new class!) • Inference-based such as Bayesian formula or decision tree • Identify relationships among variables • Use the attribute mean to fill in the missing value • Linear regression, Multiple linear regression, Nonlinear regression • It will do the least harm to the mean of existing data • If the mean is to be unbiased • Nearest-Neighbour estimator • What if the standard deviation is to be unbiased? • Finding the k neighbours nearest to the point and fill in the most frequent value or the average value • Finding neighbours in a large dataset may be slow • Use the attribute mean for all samples belonging to the same class to fill in the missing value 21 22 How to Handle Missing Data? Outliers • Note that, it is as important to avoid adding bias and distortion • Outliers are values thought to be out of range. to the data as it is to make the information available. • Approaches: • bias is added when a wrong value is filled-in • do nothing • No matter what techniques you use to conquer the problem, it • enforce upper and lower bounds comes at a price. The more guessing you have to do, the further • let binning handle the problem (in the following slides) away from the real data the database becomes. Thus, in turn, it can affect the accuracy and validation of the mining results. 23 24
Data Preprocessing Discretization • Why preprocess the data? • Divide the range of a continuous attribute into intervals • Some classification algorithms only accept discrete attributes. • Data cleaning • Reduce data size by discretization • Discretization • Prepare for further analysis • Data integration and transformation • Data reduction • Discretization is very useful for generating a summary of data • Also called “binning” 25 26 Equal-width Binning Equal-depth Binning • It divides the range into N intervals of equal size (range): uniform grid • If A and B are the lowest and highest values of the attribute, the width of • It divides the range into N intervals, each containing intervals will be: W = ( B - A )/ N. approximately same number of samples The most straightforward method • • Generally preferred because avoids clumping Outliers may dominate presentation • • In practice, “almost-equal” height binning is used to give more intuitive • Skewed data is not handled well. breakpoints • Additional considerations: • don’t split frequent values across bins • create separate bins for special values (e.g. 0) Disadvantage Advantage • readable breakpoints (e.g. round breakpoints (a) Unsupervised (a) simple and easy to implement (b) Where does N come from? (b) produce a reasonable abstraction of data (c) Sensitive to outliers 27 28
Recommend
More recommend