Transforming ne w feat u res FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington
Addressing ske w ed v ariables ggplot(online_retail, aes(x = Quantity)) + geom_density() FEATURE ENGINEERING IN R
Po w er transformations in statistics FEATURE ENGINEERING IN R
Using po w er transformations ggplot(online_retail, aes(x = Quantity)) + geom_density() FEATURE ENGINEERING IN R
Bo x- Co x transformations ggplot(transformed, aes(x = Quantity)) + geom_density() FEATURE ENGINEERING IN R
Yeo - Johnson transformation ggplot(online_retail, aes(x = Quantity)) + geom_histogram(stat = "count") FEATURE ENGINEERING IN R
Yeo - Johnson transformation ggplot(transformed, aes(x = Quantity)) + geom_density() FEATURE ENGINEERING IN R
# Transforming with caret retail_vars <- online_retail %>% select(Quantity) processed_vars <- preProcess(retail_vars, method = c("YeoJohnson")) Data frame : retail_vars Method : "BoxCox" or "YeoJohnson" Transforms all n u meric v ariables in the data frame transformed <- predict(processed_vars, online_retail) transformed contains the transformation of v ariables FEATURE ENGINEERING IN R
Plotting the transformation res u lts ggplot(transformed, aes(x = Quantity)) + geom_density() FEATURE ENGINEERING IN R
Bo x- Co x v s . Yeo - Johnson Bo x- Co x - positi v e n u meric feat u res Yeo and Johnson - n u meric feat u res w ith negati v e v al u es Both transform v ariables to � t a normal distrib u tion FEATURE ENGINEERING IN R
Yo u r t u rn ! FE ATU R E E N G IN E E R IN G IN R
Normali z ation techniq u es : Scaling and Centering FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington
Scaling to a range ra w range = 0 to 100 ne w range = 0 to 1 Usef u l on v ariables that ha v e kno w n u pper and lo w er bo u nds There are a fe w o u tliers Data is appro x imatel y u niform across the ranges FEATURE ENGINEERING IN R
adult_incomes %>% select(age) %>% range() 17 90 FEATURE ENGINEERING IN R
adult_incomes <- adult_incomes %>% mutate(scaled_age = (age - min(age)) / (max(age) - min(age))) R � nds the minim u m for y o u w ith min() R also � nds the ma x im u m w ith max() mutate() creates y o u r ne w col u mn adult_incomes %>% select(age, scaled_age) %>% summary() age scaled_age Min. :17.00 Min. :0.0000 1st Qu.:28.00 1st Qu.:0.1507 Median :37.00 Median :0.2740 Mean :38.58 Mean :0.2956 3rd Qu.:48.00 3rd Qu.:0.4247 Max. :90.00 Max. :1.0000 FEATURE ENGINEERING IN R
income_vars <- adult_incomes %>% select(age, educational_num) processed_vars <- preProcess(income_vars, method = c("range")) transformed <- predict(processed_vars, adult_incomes) transformed %>% select(age, educational_num) %>% summary() age educational_num Min. :0.0000 Min. :0.0000 1st Qu.:0.1507 1st Qu.:0.5333 Median :0.2740 Median :0.6000 Mean :0.2956 Mean :0.6054 3rd Qu.:0.4247 3rd Qu.:0.7333 Max. :1.0000 Max. :1.0000 FEATURE ENGINEERING IN R
Mean centering FEATURE ENGINEERING IN R
Coding e x ample adult_incomes <- adult_incomes %>% mutate(mscale_age = age - mean(age)) adult_incomes %>% select(age, mscale_age) %>% summary() age mscale_age Min. :17.00 Min. :-0.29564 1st Qu.:28.00 1st Qu.:-0.14495 Median :37.00 Median :-0.02167 Mean :38.58 Mean : 0.00000 3rd Qu.:48.00 3rd Qu.: 0.12902 Max. :90.00 Max. : 0.70436 FEATURE ENGINEERING IN R
Using caret and centering adult_incomes %>% select(age, hours_per_week) %>% summary() age hours_per_week Min. :17.00 Min. : 1.00 Median :37.00 Median :40.00 Mean :38.58 Mean :40.44 3rd Qu.:48.00 3rd Qu.:45.00 Max. :90.00 Max. :99.00 processed_vars <- preProcess(adult_incomes %>% select(age, hours_per_week), method = c("center")) FEATURE ENGINEERING IN R
Using caret and centering transformed <- predict(processed_vars, adult_incomes) transformed %>% select(age, hours_per_week) %>% summary() age hours_per_week Min. :-21.582 Min. :-39.4375 Median : -1.582 Median : -0.4375 Mean : 0.000 Mean : 0.0000 3rd Qu.: 9.418 3rd Qu.: 4.5625 Max. : 51.418 Max. : 58.5625 FEATURE ENGINEERING IN R
Normali z ation techniq u es s u mmar y Scaling bet w een 0 and 1: Well de � ned u pper and lo w er bo u nds Not a lot of o u tliers Centering aro u nd the mean : Helpf u l w hen y o u ha v e o u tliers FEATURE ENGINEERING IN R
It ' s y o u r t u rn ! FE ATU R E E N G IN E E R IN G IN R
Z - score standardi z ation FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington
Z - score standardi z ation Usef u l w hen : Yo u ha v e some o u tliers Meas u rements in di � erent scales of magnit u de FEATURE ENGINEERING IN R
Mean centering v s . z- score standardi z ation Mean centering changes the v al u es b u t not the scale of the v ariables Z - Score standardi z ation changes the scale to u nit v ariance FEATURE ENGINEERING IN R
online_retail <- online_retail %>% mutate(z_quantity = (Quantity - mean(Quantity))/ sd(Quantity)) Use the mean() f u nction and s u btract from the original v ariable Use the sd() f u nction to calc u late the standard de v iation online_retail %>% select(Quantity, z_quantity) %>% summary() Quantity z_quantity Min. : 1.000 Min. :-0.53561 1st Qu.: 1.000 1st Qu.:-0.53561 Median : 3.000 Median :-0.35481 Mean : 6.925 Mean : 0.00000 3rd Qu.: 8.000 3rd Qu.: 0.09717 Max. :99.000 Max. : 8.32327 FEATURE ENGINEERING IN R
Standardi z ing m u ltiple v ariables online_retail %>% select(Quantity, UnitPrice) %>% summary() Quantity UnitPrice Min. : 1.000 Min. : 0.000 1st Qu.: 1.000 1st Qu.: 1.250 Median : 3.000 Median : 2.510 Mean : 6.925 Mean : 4.137 3rd Qu.: 8.000 3rd Qu.: 4.250 Max. :99.000 Max. :950.990 FEATURE ENGINEERING IN R
processed_vars <- preProcess(online_retail %>% select(Quantity, UnitPrice), method = c("center", "scale")) Use methods "center" and "scale" online_retail <- predict(processed_vars, online_retail) online_retail %>% select("Quantity","UnitPrice") %>% summary() Quantity UnitPrice Min. : 1.000 Min. : 0.000 1st Qu.: 1.000 1st Qu.: 1.250 Median : 3.000 Median : 2.510 Mean : 6.925 Mean : 4.137 3rd Qu.: 8.000 3rd Qu.: 4.250 Max. :99.000 Max. :950.990 FEATURE ENGINEERING IN R
Let ' s get standardi z ing ! FE ATU R E E N G IN E E R IN G IN R
Recommend
More recommend