Data Preprocessing • Why preprocess the data? Data Preparation • Data cleaning • Data integration and transformation (Data preprocessing) • Data reduction, Feature selection • Discretization and concept hierarchy generation 2 Why Prepare Data? Why Prepare Data? • Some data preparation is needed for all mining tools • Preparing data also prepares the miner so that when using prepared data the miner produces better models, faster • The purpose of preparation is to transform data sets so that their information content is best exposed to the mining tool • GIGO - good data is a prerequisite for producing effective models of any type • Error prediction rate should be lower (or the same) after the preparation as before it • Some techniques are based on theoretical considerations, while others are rules of thumb based on experience 3 4
Why Prepare Data? Why Data is Dirty? • Incomplete data comes from • Data in the real world is dirty • Value not available when data collected • incomplete: lacking attribute values, lacking certain attributes of • Different consideration between the time when the data was collected and interest, or containing only aggregate data when it is analyzed • e.g., occupation=“” • Human/Hardware/Software problems • noisy: containing errors or outliers • Noisy data comes from the process of data • e.g., Salary=“-10”, Age=“222” • Collection • inconsistent: containing discrepancies in codes or names • Entry • e.g., Age=“42” Birthday=“03/07/1997” • Transmission • e.g., Was rating “1,2,3”, now rating “A, B, C” • Inconsistent data comes from • e.g., discrepancy between duplicate records • Different data sources • Functional dependency violation 5 6 Major Tasks in Data Preprocessing Why Is Data Preprocessing Important? • Data cleaning • Data extraction, cleaning, and transformation comprises the • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve majority of the work of building a data warehouse. —Bill Inmon inconsistencies • Data integration • Integration of multiple databases, data cubes, or files • Data preparation takes more than 80% of the data mining • Data transformation project time • Normalization and aggregation • Data reduction • Data preparation requires knowledge of the “business” • Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization • Part of data reduction but with particular importance, especially for numerical data 7 8
Forms of data preprocessing Data Preparation as a step in the Knowledge Discovery Process Knowledge Evaluation and Presentation Data preprocessing Data Mining Selection and Transformation Cleaning and DW Integration DB 9 10 Types of Measurements Types of Data Measurements • Nominal scale • Measurements differ in their nature and the amount of • Gives unique names to objects - no other information deducible information they give • Names of people • Qualitative vs. Quantitative 11 12
Types of Measurements Types of Measurements • Nominal scale • Nominal scale • Categorical scale • Categorical scale • Names categories of objects • Ordinal scale • Although maybe numerical, not ordered • Measured values can be ordered naturally • ZIP codes • Transitivity: (A > B) and (B > C) ⇒ (A > C) • Hair color • “blind” tasting of wines • Gender: Male, Female • Classifying students as: Very, Good, Good Sufficient,... • Marital Status: Single, Married, Divorcee, Widower • Temperature: Cool, Mild, Hot 13 14 Types of Measurements Types of Measurements • Nominal scale • Nominal scale • Categorical scale • Categorical scale • Ordinal scale • Ordinal scale • Interval scale • Interval scale • Ratio scale • The scale has a means to indicate the distance that separates measured values • measurement values can be used to determine a meaningful ratio • temperature between them • Bank account balance • Weight • Salary 15 16
Types of Measurements Data Preprocessing • Nominal scale • Why preprocess the data? More information content • Categorical scale Qualitative • Data cleaning • Ordinal scale • Data integration and transformation • Interval scale • Data reduction Quantitative • Ratio scale • Discretization and concept hierarchy generation Discrete or Continuous 17 18 Data Cleaning Definitions • Data cleaning tasks • Missing value - not captured in the data set: errors in feeding, transmission, ... • Deal with missing values • Identify outliers and smooth out noisy data • Empty value - no value in the population • Correct inconsistent data • Outlier - out-of-range value 19 20
Missing Data Missing Values • Data is not always available • There are always MVs in a real dataset • E.g., many tuples have no recorded value for several attributes, such as • MVs may have an impact on modeling, in fact, they can destroy it! customer income in sales data • Missing data may be due to • Some tools ignore missing values, others use some metric to fill • equipment malfunction in replacements • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • The modeler should avoid default automated replacement • certain data may not be considered important at the time of entry techiniques • not register history or changes of the data • Difficult to know limitations, problems and introduced bias • Missing data may need to be inferred. • Missing values may carry some information content: e.g. a credit • Replacing missing values without elsewhere capturing that application may carry information by noting which field the applicant did information removes information from the dataset not complete 21 22 How to Handle Missing Data? How to Handle Missing Data? • Ignore records (use only cases with all values) • Use a global constant to fill in the missing value • Usually done when class label is missing as most prediction methods • e.g., “unknown”. (May create a new class!) do not handle missing data well • Not effective when the percentage of missing values per attribute • Use the attribute mean to fill in the missing value varies considerably as it can lead to insufficient and/or biased sample sizes • It will do the least harm to the mean of existing data • Ignore attributes with missing values • If the mean is to be unbiased • What if the standard deviation is to be unbiased? • Use only features (attributes) with all values (may leave out important features) • Fill in the missing value manually • Use the attribute mean for all samples belonging to the same class to fill in the missing value • tedious + infeasible? 23 24
How to Handle Missing Data? How to Handle Missing Data? • Use the most probable value to fill in the missing value • Note that, it is as important to avoid adding bias and distortion to the data as it is to make the information available. • Inference-based such as Bayesian formula or decision tree • bias is added when a wrong value is filled-in • Identify relationships among variables • Linear regression, Multiple linear regression, Nonlinear regression • No matter what techniques you use to conquer the problem, it comes at a price. The more guessing you have to do, the further away from the real data the database becomes. Thus, in turn, it • Nearest-Neighbor estimator can affect the accuracy and validation of the mining results. • Finding the k neighbors nearest to the point and fill in the most frequent value or the average value • Finding neighbors in a large dataset may be slow 25 26 How to Handle Noisy Data? Noisy Data • Binning method (smooth a value by consulting its neighborhood) : • Noise: random error or variance in a measured variable • first sort data and partition it into a number of bins • then one can smooth by bin means, smooth by bin median or smooth by bin boundaries • Incorrect attribute values may due to • Clustering • faulty data collection instruments • detect and remove outliers • data entry problems • (Outliers may be what we are looking for – fraud detection) • data transmission problems • Regression • technology limitation (ex: limited buffer size) • smooth by fitting the data into regression functions • inconsistency in naming convention • Combined computer and human inspection • detect suspicious values (using measures that reflect the degree of surprise or compare the predicted class label with the know label) and check by human 27 28
Recommend
More recommend