data warehousing and machine learning
play

Data Warehousing and Machine Learning Preprocessing Thomas D. - PowerPoint PPT Presentation

Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 35 Preprocessing Before you can start on the actual data mining, the data may require


  1. Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 35

  2. Preprocessing Before you can start on the actual data mining, the data may require some preprocessing: • Attributes may be redundant. • Values may be missing. • The data contains outliers. • The data is not in a suitable format. • The values appear inconsistent. Garbage in, garbage out DWML Spring 2008 2 / 35

  3. Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 DWML Spring 2008 3 / 35

  4. Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Correct zip code? DWML Spring 2008 3 / 35

  5. Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Correct zip code? DWML Spring 2008 3 / 35

  6. Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 ?? 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Missing value! DWML Spring 2008 3 / 35

  7. Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Error/outlier! DWML Spring 2008 3 / 35

  8. Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Error! DWML Spring 2008 3 / 35

  9. Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Unexpected precision. DWML Spring 2008 3 / 35

  10. Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Categorical value? DWML Spring 2008 3 / 35

  11. Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Error/missing value? DWML Spring 2008 3 / 35

  12. Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 Other issues: • What are the semantics of the marital status? DWML Spring 2008 3 / 35

  13. Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 Other issues: • What are the semantics of the marital status? • What is the unit of measure for the transaction field? DWML Spring 2008 3 / 35

  14. Preprocessing Missing Values In many real world data bases you will be faced with the problem of missing data: Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good By simply discarding the records with missing data we might unintentionally bias the data. DWML Spring 2008 4 / 35

  15. Preprocessing Missing Values Possible strategies for handling missing data: • Use a predefined constant. • Use the mean (for numerical variables) or the mode (for categorical values). • Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good DWML Spring 2008 5 / 35

  16. Preprocessing Missing Values Possible strategies for handling missing data: • Use a predefined constant. • Use the mean (for numerical variables) or the mode (for categorical values). • Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good Both Low and Medium are ’modes’ for savings. DWML Spring 2008 5 / 35

  17. Preprocessing Missing Values Possible strategies for handling missing data: • Use a predefined constant. • Use the mean (for numerical variables) or the mode (for categorical values). • Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low High 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low Medium 25 Bad 8 Medium Medium 75 Good High and Medium are drawn randomly from the observed distribution for Assets. DWML Spring 2008 5 / 35

  18. Preprocessing Missing Values Possible strategies for handling missing data: • Use a predefined constant. • Use the mean (for numerical variables) or the mode (for categorical values). • Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low High 25 Bad 4 Medium Medium 54 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low Medium 25 Bad 8 Medium Medium 75 Good 54 ≈ 75 + 50 + 25 + 100 + 25 + 25 + 75 . 7 DWML Spring 2008 5 / 35

  19. Preprocessing Discretization Some data mining algorithms can only handle discrete attributes. Possible solution: Divide the continuous range into intervals. Example: ( Income , Risk ) = � ( 25 , B ) , ( 25 , B ) , ( 50 , G ) , ( 51 , B ) , ( 54 , G ) , ( 75 , G ) , ( 75 , G )( 100 , G ) , ( 100 , G ) � Unsupervised discretization Equal width binning (width 25): Bin 1: 25, 25 [ 25 , 50 ) Bin 2: 50, 51, 54 [ 50 , 75 ) Bin 3: 75, 75, 100, 100 [ 75 , 100 ] Equal frequency binning (bin density 3): Bin 1: 25, 25, 50 [ 25 , 50 . 5 ) Bin 2: 51, 54, 75, 75 [ 50 . 5 , 87 . 5 ) Bin 3: 100, 100 [ 87 . 5 , 100 ] DWML Spring 2008 6 / 35

  20. Preprocessing Supervised discretization Take the class distribution into account when selecting the intervals. For example, recursively bisect the interval by selecting the split point giving the highest information gain: » | S ≤ v | | S | Ent ( S ≤ v ) + | S > v | – Gain ( S , v ) = Ent ( S ) − | S | Ent ( S > v ) Until some stopping criteria is met. ( Income , Risk ) = � ( 25 , B ) , ( 25 , B ) , ( 50 , G ) , ( 51 , B ) , ( 54 , G ) , ( 75 , G ) , ( 75 , G )( 100 , G ) , ( 100 , G ) � „ 3 3 9 + 6 6 « Ent ( S ) = − 9 log 2 9 log 2 = 0 . 9183 9 Split E-Ent Interval 25 0 . 4602 ( −∞ , 25 ] , ( 25 , ∞ ) 50 0 . 7395 ( −∞ , 50 ] , ( 50 , ∞ ) 51 0 . 3606 ( −∞ , 51 ] , ( 51 , ∞ ) 54 0 . 5394 ( −∞ , 54 ] , ( 54 , ∞ ) 75 0 . 7663 ( −∞ , 75 ] , ( 75 , ∞ ) DWML Spring 2008 7 / 35

  21. Preprocessing Data Transformation Some data mining tools tends to give variables with a large range a higher significance than variables with a smaller range. For example, • Age versus income. DWML Spring 2008 8 / 35

  22. Preprocessing Data Transformation Some data mining tools tends to give variables with a large range a higher significance than variables with a smaller range. For example, • Age versus income. The typical approach is to standardize the scales: 1 0.8 Min-Max Normalization: normalized values 0.6 X − min ( X ) X ∗ = max ( X ) − min ( X ) . 0.4 0.2 A1 A2 0 -20 0 20 40 60 80 100 120 original values DWML Spring 2008 8 / 35

Recommend


More recommend