Data Warehousing and Machine Learning Preprocessing Thomas D. - PowerPoint PPT Presentation

Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 35

Preprocessing Before you can start on the actual data mining, the data may require some preprocessing: • Attributes may be redundant. • Values may be missing. • The data contains outliers. • The data is not in a suitable format. • The values appear inconsistent. Garbage in, garbage out DWML Spring 2008 2 / 35

Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 DWML Spring 2008 3 / 35

Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Correct zip code? DWML Spring 2008 3 / 35

Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 ?? 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Missing value! DWML Spring 2008 3 / 35

Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Error/outlier! DWML Spring 2008 3 / 35

Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Error! DWML Spring 2008 3 / 35

Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Unexpected precision. DWML Spring 2008 3 / 35

Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Categorical value? DWML Spring 2008 3 / 35

Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 – Error/missing value? DWML Spring 2008 3 / 35

Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 Other issues: • What are the semantics of the marital status? DWML Spring 2008 3 / 35

Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 Other issues: • What are the semantics of the marital status? • What is the unit of measure for the transaction field? DWML Spring 2008 3 / 35

Preprocessing Missing Values In many real world data bases you will be faced with the problem of missing data: Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good By simply discarding the records with missing data we might unintentionally bias the data. DWML Spring 2008 4 / 35

Preprocessing Missing Values Possible strategies for handling missing data: • Use a predefined constant. • Use the mean (for numerical variables) or the mode (for categorical values). • Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good DWML Spring 2008 5 / 35

Preprocessing Missing Values Possible strategies for handling missing data: • Use a predefined constant. • Use the mean (for numerical variables) or the mode (for categorical values). • Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good Both Low and Medium are ’modes’ for savings. DWML Spring 2008 5 / 35

Preprocessing Missing Values Possible strategies for handling missing data: • Use a predefined constant. • Use the mean (for numerical variables) or the mode (for categorical values). • Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low High 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low Medium 25 Bad 8 Medium Medium 75 Good High and Medium are drawn randomly from the observed distribution for Assets. DWML Spring 2008 5 / 35

Preprocessing Missing Values Possible strategies for handling missing data: • Use a predefined constant. • Use the mean (for numerical variables) or the mode (for categorical values). • Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low High 25 Bad 4 Medium Medium 54 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low Medium 25 Bad 8 Medium Medium 75 Good 54 ≈ 75 + 50 + 25 + 100 + 25 + 25 + 75 . 7 DWML Spring 2008 5 / 35

Preprocessing Discretization Some data mining algorithms can only handle discrete attributes. Possible solution: Divide the continuous range into intervals. Example: ( Income , Risk ) = � ( 25 , B ) , ( 25 , B ) , ( 50 , G ) , ( 51 , B ) , ( 54 , G ) , ( 75 , G ) , ( 75 , G )( 100 , G ) , ( 100 , G ) � Unsupervised discretization Equal width binning (width 25): Bin 1: 25, 25 [ 25 , 50 ) Bin 2: 50, 51, 54 [ 50 , 75 ) Bin 3: 75, 75, 100, 100 [ 75 , 100 ] Equal frequency binning (bin density 3): Bin 1: 25, 25, 50 [ 25 , 50 . 5 ) Bin 2: 51, 54, 75, 75 [ 50 . 5 , 87 . 5 ) Bin 3: 100, 100 [ 87 . 5 , 100 ] DWML Spring 2008 6 / 35

Preprocessing Supervised discretization Take the class distribution into account when selecting the intervals. For example, recursively bisect the interval by selecting the split point giving the highest information gain: » | S ≤ v | | S | Ent ( S ≤ v ) + | S > v | – Gain ( S , v ) = Ent ( S ) − | S | Ent ( S > v ) Until some stopping criteria is met. ( Income , Risk ) = � ( 25 , B ) , ( 25 , B ) , ( 50 , G ) , ( 51 , B ) , ( 54 , G ) , ( 75 , G ) , ( 75 , G )( 100 , G ) , ( 100 , G ) � „ 3 3 9 + 6 6 « Ent ( S ) = − 9 log 2 9 log 2 = 0 . 9183 9 Split E-Ent Interval 25 0 . 4602 ( −∞ , 25 ] , ( 25 , ∞ ) 50 0 . 7395 ( −∞ , 50 ] , ( 50 , ∞ ) 51 0 . 3606 ( −∞ , 51 ] , ( 51 , ∞ ) 54 0 . 5394 ( −∞ , 54 ] , ( 54 , ∞ ) 75 0 . 7663 ( −∞ , 75 ] , ( 75 , ∞ ) DWML Spring 2008 7 / 35

Preprocessing Data Transformation Some data mining tools tends to give variables with a large range a higher significance than variables with a smaller range. For example, • Age versus income. DWML Spring 2008 8 / 35

Preprocessing Data Transformation Some data mining tools tends to give variables with a large range a higher significance than variables with a smaller range. For example, • Age versus income. The typical approach is to standardize the scales: 1 0.8 Min-Max Normalization: normalized values 0.6 X − min ( X ) X ∗ = max ( X ) − min ( X ) . 0.4 0.2 A1 A2 0 -20 0 20 40 60 80 100 120 original values DWML Spring 2008 8 / 35

Data Warehousing and Machine Learning Preprocessing Thomas D. - PowerPoint PPT Presentation

Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 35 Preprocessing Before you can start on the actual data mining, the data may require

Database Management Objectives of Lecture 5 Systems Data Warehousing and OLAP Data Warehousing

Data Warehousing Outline Overview of data warehousing Dimensional Modeling Online

CS520 Data Integration, Warehousing, and Provenance 6. Data Warehousing IIT DBGroup Boris

A Generic Solution for A Generic Solution for Warehousing Business Warehousing Business

Warehousing Warehousing are the activities involved in the design and operation of warehouses

Lifting your cargoes faster shorecranes up to 208 tons rhb stevedoring & warehousing Our

TAPPI Shipping, Receiving & Warehousing Workshop TAPPI Shipping, Receiving & Warehousing

4/14/20 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2)

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

5/5/16 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2)

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Data Management and Analysis with Business Applications Data Warehousing Andrea Brunello

Data Mining Data warehousing Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

G RAVITATIONAL C LUSTERING OF THE S ELF -O RGANIZING M AP Nejc Ilc Andrej Dobnikar University of

Supporting Information Management in Digital Libraries with Map-based Interfaces Rudolf Mayer ,

A. Morrison, G. Ross,

Pattern Analysis and Machine Intelligence Lecture Notes on Clustering (II) 2011-2012 Davide

Machine Learning in Reservoir Production Simulation and Forecast Serge A. Terekhov NeurOK

Machine Translation Research in META-NET Jan Haji Institute of Formal and Applied Linguistics

Modeling the cognitive spatio-temporal operations using associative memories and multiplicative

18.175: Lecture 11 Independent sums and large deviations Scott Sheffield MIT 1 18.175 Lecture 11

Data Warehousing and Machine Learning Preprocessing Thomas D. - PowerPoint PPT Presentation

Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 35 Preprocessing Before you can start on the actual data mining, the data may require

Database Management Objectives of Lecture 5 Systems Data Warehousing and OLAP Data Warehousing

Data Warehousing Outline Overview of data warehousing Dimensional Modeling Online

CS520 Data Integration, Warehousing, and Provenance 6. Data Warehousing IIT DBGroup Boris

A Generic Solution for A Generic Solution for Warehousing Business Warehousing Business

Warehousing Warehousing are the activities involved in the design and operation of warehouses

Lifting your cargoes faster shorecranes up to 208 tons rhb stevedoring &amp; warehousing Our

TAPPI Shipping, Receiving &amp; Warehousing Workshop TAPPI Shipping, Receiving &amp; Warehousing

4/14/20 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2)

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

5/5/16 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2)

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Data Management and Analysis with Business Applications Data Warehousing Andrea Brunello

Data Mining Data warehousing Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

G RAVITATIONAL C LUSTERING OF THE S ELF -O RGANIZING M AP Nejc Ilc Andrej Dobnikar University of

Supporting Information Management in Digital Libraries with Map-based Interfaces Rudolf Mayer ,

A. Morrison, G. Ross,

Pattern Analysis and Machine Intelligence Lecture Notes on Clustering (II) 2011-2012 Davide

Machine Learning in Reservoir Production Simulation and Forecast Serge A. Terekhov NeurOK

Machine Translation Research in META-NET Jan Haji Institute of Formal and Applied Linguistics

Modeling the cognitive spatio-temporal operations using associative memories and multiplicative

18.175: Lecture 11 Independent sums and large deviations Scott Sheffield MIT 1 18.175 Lecture 11

Lifting your cargoes faster shorecranes up to 208 tons rhb stevedoring & warehousing Our

TAPPI Shipping, Receiving & Warehousing Workshop TAPPI Shipping, Receiving & Warehousing