data preparation
play

Data Preparation Discretization Data cleaning (Data - PowerPoint PPT Presentation

Data Preparation Why prepare the data? Data Preparation Discretization Data cleaning (Data pre-processing) Data integration and transformation Data reduction, Feature selection 2 Why Prepare Data? Why Prepare Data?


  1. Data Preparation • Why prepare the data? Data Preparation • Discretization • Data cleaning (Data pre-processing) • Data integration and transformation • Data reduction, Feature selection 2 Why Prepare Data? Why Prepare Data? • Some data preparation is needed for all mining tools • Preparing data also prepares the miner so that when using prepared data the miner produces better • The purpose of preparation is to transform data sets models, faster so that their information content is best exposed to the mining tool • Error prediction rate should be lower (or the same) • GIGO - good data is a prerequisite for producing after the preparation as before it effective models of any type 3 4

  2. Major Tasks in Data Preparation Why Prepare Data? • Data discretization • Data need to be formatted for a given software tool • Part of data reduction but with particular importance, especially for numerical data • Data cleaning • Data in the real world is dirty • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • incomplete: lacking attribute values, lacking certain attributes of • Data integration interest, or containing only aggregate data • Integration of multiple databases, data cubes, or files • e.g., occupation=“” • Data transformation • noisy: containing errors or outliers • Normalization and aggregation • e.g., Salary=“-10”, Age=“222” • Data reduction • inconsistent: containing discrepancies in codes or names • Obtains reduced representation in volume but produces the same or similar analytical • e.g., Age=“42” Birthday=“03/07/1997” results • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records 5 6 Data Preparation as a step in the Types of Data Measurements Knowledge Discovery Process Knowledge Evaluation and Presentation • Measurements differ in their nature and the Data Mining Data preparation amount of information they give Selection and Transformation • Qualitative vs. Quantitative Cleaning and DW Integration DB 7 8

  3. Types of Measurements Types of Measurements • Nominal scale • Nominal scale • Gives unique names to objects - no other information deducible • Categorical scale • Names of people • Names categories of objects • Although maybe numerical, not ordered • ZIP codes • Hair color • Gender: Male, Female • Marital Status: Single, Married, Divorcee, Widower 9 10 Types of Measurements Types of Measurements • Nominal scale • Nominal scale • Categorical scale • Categorical scale • Ordinal scale • Ordinal scale • Measured values can be ordered naturally • Interval scale • Transitivity: (A > B) and (B > C) ⇒ (A > C) • The scale has a means to indicate the distance that separates • “blind” tasting of wines measured values • Classifying students as: Very Good, Good, Sufficient,... • Temperature • Temperature: Cool, Mild, Hot 11 12

  4. Types of Measurements Types of Measurements • Nominal scale • Nominal scale More information content • Categorical scale • Categorical scale Qualitative • Ordinal scale • Ordinal scale • Interval scale • Interval scale Quantitative • Ratio scale • Ratio scale • measurement values can be used to determine a meaningful ratio between them Discrete or Continuous • Bank account balance • Weight • Salary 13 14 Data Conversion Conversion: Nominal, Many Values • Some tools can deal with nominal values but other need fields to • Examples: be numeric • US State Code (50 values) • Profession Code (7,000 values, but only few frequent) • Convert ordinal fields to numeric to be able to use “>” and “<“ comparisons on such fields. • Ignore ID-like fields whose values are unique for each record • A � 4.0 • A- � 3.7 • For other fields, group values “naturally”: • B+ � 3.3 • e.g. 50 US States � 3 or 5 regions • B � 3.0 • Profession – select most frequent ones, group the rest • Multi-valued, unordered attributes with small no. of values • Create binary flag-fields for selected values • e.g. Color=Red, Orange, Yellow, …, Violet • for each value v create a binary “flag” variable C_ v , which is 1 if Color= v , 0 otherwise 15 16

  5. Discretization Top-down (Splitting) versus Bottom-up (Merging) • Divide the range of a continuous attribute into intervals • Top-down methods start with an empty list of cut-points (or split-points) and keep on adding new ones to the list by ‘splitting’ • Some methods require discrete values, e.g. most versions of Naïve intervals as the discretization progresses. Bayes, CHAID • Reduce data size by discretization • Prepare for further analysis • Bottom-up methods start with the complete list of all the continuous values of the feature as cut-points and remove some of them by ‘merging’ intervals as the discretization progresses. • Discretization is very useful for generating a summary of data • Also called “binning” 17 18 Equal-width Binning Equal-width Binning • It divides the range into N intervals of equal size (range): uniform grid • If A and B are the lowest and highest values of the attribute, the width of intervals will be: W = ( B - A )/ N. Count Temperature values: 64 65 68 69 70 71 72 72 75 75 80 81 83 85 1 [0 – 200,000) … …. Count [1,800,000 – 2,000,000] 4 Salary in a corporation 0 2 2 2 2 2 Disadvantage [64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85] Advantage (a) Unsupervised Equal Width, bins Low <= value < High (a) simple and easy to implement (b) Where does N come from? (b) produce a reasonable abstraction of data (c) Sensitive to outliers 19 20

  6. Equal-depth (or height) Binning Equal-depth (or height) Binning Temperature values: • It divides the range into N intervals, each containing 64 65 68 69 70 71 72 72 75 75 80 81 83 85 approximately the same number of samples • Generally preferred because avoids clumping • In practice, “almost-equal” height binning is used to give more intuitive Count breakpoints 4 4 4 • Additional considerations: 2 • don’t split frequent values across bins [64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85] • create separate bins for special values (e.g. 0) • readable breakpoints (e.g. round breakpoints Equal Height = 4, except for the last bin 21 22 Discretization considerations Method 1R • Class-independent methods • Developed by Holte (1993). • It is a supervised discretization method using binning. • Equal Width is simpler, good for many classes • can fail miserably for unequal distributions • After sorting the data, the range of continuous values is divided into a • Equal Height gives better results number of disjoint intervals and the boundaries of those intervals are adjusted based on the class labels associated with the values of the • Class-dependent methods can be better for classification feature. • Note: decision tree methods build discretization on the fly • Each interval should contain a given minimum of instances ( 6 by default) • Naïve Bayes requires initial discretization with the exception of the last one. • Many other methods exist … • The adjustment of the boundary continues until the next values belongs to a class different to the majority class in the adjacent interval. 23 24

  7. 1R Example Entropy Based Discretization Class dependent (classification) 1. Sort examples in increasing order 2. Each value forms an interval (‘m’ intervals) 3. Calculate the entropy measure of this discretization 4. Find the binary split boundary that minimizes the entropy function over all possible boundaries. The split is selected as a binary discretization. S S | | | | S S E S T ( , ) Ent ( ) Ent ( ) = + 1 2 | S | | S | 1 2 5. Apply the process recursively until some stopping criterion is met, e.g., Ent S ( ) − E T S ( , ) > δ 25 26 Entropy Entropy/Impurity p 1-p Ent 0.2 0.8 0.72 0.4 0.6 0.97 0.5 0.5 1 • S - training set, C 1 ,...,C N classes 0.6 0.4 0.97 0.8 0.2 0.72 • Entropy E(S) - measure of the impurity in a group of examples log 2 (2) • p c - proportion of C c in S N ∑ Impurity( ) S p log p = − ⋅ c c 2 c = 1 p1 p2 p3 Ent 0.1 0.1 0.8 0.92 0.2 0.2 0.6 1.37 0.1 0.45 0.45 1.37 N ∑ Ent p log p = − ⋅ 0.2 0.4 0.4 1.52 c c log 2 (3) 2 c = 1 0.3 0.3 0.4 1.57 0.33 0.33 0.33 1.58 27 28

Recommend


More recommend