transforming ne w feat u res
play

Transforming ne w feat u res FE ATU R E E N G IN E E R IN G IN R - PowerPoint PPT Presentation

Transforming ne w feat u res FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington Addressing ske w ed v ariables ggplot(online_retail, aes(x = Quantity)) + geom_density() FEATURE ENGINEERING IN R Po


  1. Transforming ne w feat u res FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington

  2. Addressing ske w ed v ariables ggplot(online_retail, aes(x = Quantity)) + geom_density() FEATURE ENGINEERING IN R

  3. Po w er transformations in statistics FEATURE ENGINEERING IN R

  4. Using po w er transformations ggplot(online_retail, aes(x = Quantity)) + geom_density() FEATURE ENGINEERING IN R

  5. Bo x- Co x transformations ggplot(transformed, aes(x = Quantity)) + geom_density() FEATURE ENGINEERING IN R

  6. Yeo - Johnson transformation ggplot(online_retail, aes(x = Quantity)) + geom_histogram(stat = "count") FEATURE ENGINEERING IN R

  7. Yeo - Johnson transformation ggplot(transformed, aes(x = Quantity)) + geom_density() FEATURE ENGINEERING IN R

  8. # Transforming with caret retail_vars <- online_retail %>% select(Quantity) processed_vars <- preProcess(retail_vars, method = c("YeoJohnson")) Data frame : retail_vars Method : "BoxCox" or "YeoJohnson" Transforms all n u meric v ariables in the data frame transformed <- predict(processed_vars, online_retail) transformed contains the transformation of v ariables FEATURE ENGINEERING IN R

  9. Plotting the transformation res u lts ggplot(transformed, aes(x = Quantity)) + geom_density() FEATURE ENGINEERING IN R

  10. Bo x- Co x v s . Yeo - Johnson Bo x- Co x - positi v e n u meric feat u res Yeo and Johnson - n u meric feat u res w ith negati v e v al u es Both transform v ariables to � t a normal distrib u tion FEATURE ENGINEERING IN R

  11. Yo u r t u rn ! FE ATU R E E N G IN E E R IN G IN R

  12. Normali z ation techniq u es : Scaling and Centering FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington

  13. Scaling to a range ra w range = 0 to 100 ne w range = 0 to 1 Usef u l on v ariables that ha v e kno w n u pper and lo w er bo u nds There are a fe w o u tliers Data is appro x imatel y u niform across the ranges FEATURE ENGINEERING IN R

  14. adult_incomes %>% select(age) %>% range() 17 90 FEATURE ENGINEERING IN R

  15. adult_incomes <- adult_incomes %>% mutate(scaled_age = (age - min(age)) / (max(age) - min(age))) R � nds the minim u m for y o u w ith min() R also � nds the ma x im u m w ith max() mutate() creates y o u r ne w col u mn adult_incomes %>% select(age, scaled_age) %>% summary() age scaled_age Min. :17.00 Min. :0.0000 1st Qu.:28.00 1st Qu.:0.1507 Median :37.00 Median :0.2740 Mean :38.58 Mean :0.2956 3rd Qu.:48.00 3rd Qu.:0.4247 Max. :90.00 Max. :1.0000 FEATURE ENGINEERING IN R

  16. income_vars <- adult_incomes %>% select(age, educational_num) processed_vars <- preProcess(income_vars, method = c("range")) transformed <- predict(processed_vars, adult_incomes) transformed %>% select(age, educational_num) %>% summary() age educational_num Min. :0.0000 Min. :0.0000 1st Qu.:0.1507 1st Qu.:0.5333 Median :0.2740 Median :0.6000 Mean :0.2956 Mean :0.6054 3rd Qu.:0.4247 3rd Qu.:0.7333 Max. :1.0000 Max. :1.0000 FEATURE ENGINEERING IN R

  17. Mean centering FEATURE ENGINEERING IN R

  18. Coding e x ample adult_incomes <- adult_incomes %>% mutate(mscale_age = age - mean(age)) adult_incomes %>% select(age, mscale_age) %>% summary() age mscale_age Min. :17.00 Min. :-0.29564 1st Qu.:28.00 1st Qu.:-0.14495 Median :37.00 Median :-0.02167 Mean :38.58 Mean : 0.00000 3rd Qu.:48.00 3rd Qu.: 0.12902 Max. :90.00 Max. : 0.70436 FEATURE ENGINEERING IN R

  19. Using caret and centering adult_incomes %>% select(age, hours_per_week) %>% summary() age hours_per_week Min. :17.00 Min. : 1.00 Median :37.00 Median :40.00 Mean :38.58 Mean :40.44 3rd Qu.:48.00 3rd Qu.:45.00 Max. :90.00 Max. :99.00 processed_vars <- preProcess(adult_incomes %>% select(age, hours_per_week), method = c("center")) FEATURE ENGINEERING IN R

  20. Using caret and centering transformed <- predict(processed_vars, adult_incomes) transformed %>% select(age, hours_per_week) %>% summary() age hours_per_week Min. :-21.582 Min. :-39.4375 Median : -1.582 Median : -0.4375 Mean : 0.000 Mean : 0.0000 3rd Qu.: 9.418 3rd Qu.: 4.5625 Max. : 51.418 Max. : 58.5625 FEATURE ENGINEERING IN R

  21. Normali z ation techniq u es s u mmar y Scaling bet w een 0 and 1: Well de � ned u pper and lo w er bo u nds Not a lot of o u tliers Centering aro u nd the mean : Helpf u l w hen y o u ha v e o u tliers FEATURE ENGINEERING IN R

  22. It ' s y o u r t u rn ! FE ATU R E E N G IN E E R IN G IN R

  23. Z - score standardi z ation FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington

  24. Z - score standardi z ation Usef u l w hen : Yo u ha v e some o u tliers Meas u rements in di � erent scales of magnit u de FEATURE ENGINEERING IN R

  25. Mean centering v s . z- score standardi z ation Mean centering changes the v al u es b u t not the scale of the v ariables Z - Score standardi z ation changes the scale to u nit v ariance FEATURE ENGINEERING IN R

  26. online_retail <- online_retail %>% mutate(z_quantity = (Quantity - mean(Quantity))/ sd(Quantity)) Use the mean() f u nction and s u btract from the original v ariable Use the sd() f u nction to calc u late the standard de v iation online_retail %>% select(Quantity, z_quantity) %>% summary() Quantity z_quantity Min. : 1.000 Min. :-0.53561 1st Qu.: 1.000 1st Qu.:-0.53561 Median : 3.000 Median :-0.35481 Mean : 6.925 Mean : 0.00000 3rd Qu.: 8.000 3rd Qu.: 0.09717 Max. :99.000 Max. : 8.32327 FEATURE ENGINEERING IN R

  27. Standardi z ing m u ltiple v ariables online_retail %>% select(Quantity, UnitPrice) %>% summary() Quantity UnitPrice Min. : 1.000 Min. : 0.000 1st Qu.: 1.000 1st Qu.: 1.250 Median : 3.000 Median : 2.510 Mean : 6.925 Mean : 4.137 3rd Qu.: 8.000 3rd Qu.: 4.250 Max. :99.000 Max. :950.990 FEATURE ENGINEERING IN R

  28. processed_vars <- preProcess(online_retail %>% select(Quantity, UnitPrice), method = c("center", "scale")) Use methods "center" and "scale" online_retail <- predict(processed_vars, online_retail) online_retail %>% select("Quantity","UnitPrice") %>% summary() Quantity UnitPrice Min. : 1.000 Min. : 0.000 1st Qu.: 1.000 1st Qu.: 1.250 Median : 3.000 Median : 2.510 Mean : 6.925 Mean : 4.137 3rd Qu.: 8.000 3rd Qu.: 4.250 Max. :99.000 Max. :950.990 FEATURE ENGINEERING IN R

  29. Let ' s get standardi z ing ! FE ATU R E E N G IN E E R IN G IN R

Recommend


More recommend