Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 35
Preprocessing Before you can start on the actual data mining, the data may require some preprocessing: • Attributes may be redundant. • Values may be missing. • The data contains outliers. • The data is not in a suitable format. • The values appear inconsistent. Garbage in, garbage out DWML Spring 2008 2 / 35
Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 DWML Spring 2008 3 / 35
Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Correct zip code? DWML Spring 2008 3 / 35
Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Correct zip code? DWML Spring 2008 3 / 35
Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 ?? 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Missing value! DWML Spring 2008 3 / 35
Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Error/outlier! DWML Spring 2008 3 / 35
Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Error! DWML Spring 2008 3 / 35
Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Unexpected precision. DWML Spring 2008 3 / 35
Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Categorical value? DWML Spring 2008 3 / 35
Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Error/missing value? DWML Spring 2008 3 / 35
Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 Other issues: • What are the semantics of the marital status? DWML Spring 2008 3 / 35
Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 Other issues: • What are the semantics of the marital status? • What is the unit of measure for the transaction field? DWML Spring 2008 3 / 35
Preprocessing Missing Values In many real world data bases you will be faced with the problem of missing data: Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good By simply discarding the records with missing data we might unintentionally bias the data. DWML Spring 2008 4 / 35
Preprocessing Missing Values Possible strategies for handling missing data: • Use a predefined constant. • Use the mean (for numerical variables) or the mode (for categorical values). • Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good DWML Spring 2008 5 / 35
Preprocessing Missing Values Possible strategies for handling missing data: • Use a predefined constant. • Use the mean (for numerical variables) or the mode (for categorical values). • Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good Both Low and Medium are ’modes’ for savings. DWML Spring 2008 5 / 35
Preprocessing Missing Values Possible strategies for handling missing data: • Use a predefined constant. • Use the mean (for numerical variables) or the mode (for categorical values). • Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low High 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low Medium 25 Bad 8 Medium Medium 75 Good High and Medium are drawn randomly from the observed distribution for Assets. DWML Spring 2008 5 / 35
Preprocessing Missing Values Possible strategies for handling missing data: • Use a predefined constant. • Use the mean (for numerical variables) or the mode (for categorical values). • Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low High 25 Bad 4 Medium Medium 54 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low Medium 25 Bad 8 Medium Medium 75 Good 54 ≈ 75 + 50 + 25 + 100 + 25 + 25 + 75 . 7 DWML Spring 2008 5 / 35
Preprocessing Discretization Some data mining algorithms can only handle discrete attributes. Possible solution: Divide the continuous range into intervals. Example: ( Income , Risk ) = � ( 25 , B ) , ( 25 , B ) , ( 50 , G ) , ( 51 , B ) , ( 54 , G ) , ( 75 , G ) , ( 75 , G )( 100 , G ) , ( 100 , G ) � Unsupervised discretization Equal width binning (width 25): Bin 1: 25, 25 [ 25 , 50 ) Bin 2: 50, 51, 54 [ 50 , 75 ) Bin 3: 75, 75, 100, 100 [ 75 , 100 ] Equal frequency binning (bin density 3): Bin 1: 25, 25, 50 [ 25 , 50 . 5 ) Bin 2: 51, 54, 75, 75 [ 50 . 5 , 87 . 5 ) Bin 3: 100, 100 [ 87 . 5 , 100 ] DWML Spring 2008 6 / 35
Preprocessing Supervised discretization Take the class distribution into account when selecting the intervals. For example, recursively bisect the interval by selecting the split point giving the highest information gain: » | S ≤ v | | S | Ent ( S ≤ v ) + | S > v | – Gain ( S , v ) = Ent ( S ) − | S | Ent ( S > v ) Until some stopping criteria is met. ( Income , Risk ) = � ( 25 , B ) , ( 25 , B ) , ( 50 , G ) , ( 51 , B ) , ( 54 , G ) , ( 75 , G ) , ( 75 , G )( 100 , G ) , ( 100 , G ) � „ 3 3 9 + 6 6 « Ent ( S ) = − 9 log 2 9 log 2 = 0 . 9183 9 Split E-Ent Interval 25 0 . 4602 ( −∞ , 25 ] , ( 25 , ∞ ) 50 0 . 7395 ( −∞ , 50 ] , ( 50 , ∞ ) 51 0 . 3606 ( −∞ , 51 ] , ( 51 , ∞ ) 54 0 . 5394 ( −∞ , 54 ] , ( 54 , ∞ ) 75 0 . 7663 ( −∞ , 75 ] , ( 75 , ∞ ) DWML Spring 2008 7 / 35
Preprocessing Data Transformation Some data mining tools tends to give variables with a large range a higher significance than variables with a smaller range. For example, • Age versus income. DWML Spring 2008 8 / 35
Preprocessing Data Transformation Some data mining tools tends to give variables with a large range a higher significance than variables with a smaller range. For example, • Age versus income. The typical approach is to standardize the scales: 1 0.8 Min-Max Normalization: normalized values 0.6 X − min ( X ) X ∗ = max ( X ) − min ( X ) . 0.4 0.2 A1 A2 0 -20 0 20 40 60 80 100 120 original values DWML Spring 2008 8 / 35
Recommend
More recommend