feat u re crossing
play

Feat u re crossing FE ATU R E E N G IN E E R IN G IN R Jose - PowerPoint PPT Presentation

Feat u re crossing FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington E x amples of crossing feat u res Ho u sing price prediction [ Location x bedroom number ] H u man a rib u tes [ gender x


  1. Feat u re crossing FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington

  2. E x amples of crossing feat u res Ho u sing price prediction [ Location x bedroom number ] H u man a � rib u tes [ gender x height ] FEATURE ENGINEERING IN R

  3. Crossing categorical feat u res discipline_logs %>% select(infraction) %>% table() academic dishonesty alcohol 1294 746 disruptive conduct failure to cooperate 1031 4072 fighting minor incident 2135 522 plagiarism vandalism 112 88 discipline_logs %>% select(gender) %>% table() Female Male 3055 6945 FEATURE ENGINEERING IN R

  4. E x ploring Vis u all y discipline_logs %>% group_by(infraction, gender) %>% summarize(n = n()) %>% ggplot(., aes(infraction, n, fill = gender)) + geom_bar(stat = "identity", position = "dodge") FEATURE ENGINEERING IN R

  5. FEATURE ENGINEERING IN R

  6. E x ploring crossed feat u res discipline_logs %>% select(gender, infraction) %>% table() infraction gender academic dishonesty alcohol disruptive conduct Female 393 222 330 Male 901 524 701 infraction gender failure to cooperate fighting minor incident Female 1258 638 150 Male 2814 1497 372 infraction gender plagiarism vandalism Female 39 25 Male 73 63 FEATURE ENGINEERING IN R

  7. dmy <- dummyVars( ~ gender:infraction, data = discipline_logs) gender:infraction out_df <- predict(dmy, newdata = discipline_logs) glimpse(out_df) Observations: 10,000 Variables: 16 $ genderFemale.infractionacademic.dishonesty <dbl> 0, 0, 0, 0, 0... $ genderMale.infractionacademic.dishonesty <dbl> 0, 0, 0, 0, 0... $ genderFemale.infractionalcohol <dbl> 0, 0, 0, 0, 0... $ genderMale.infractionalcohol <dbl> 0, 0, 0, 0, 0... $ genderFemale.infractiondisruptive.conduct <dbl> 0, 1, 0, 0, 0... $ genderMale.infractiondisruptive.conduct <dbl> 1, 0, 0, 0, 0... $ genderFemale.infractionfailure.to.cooperate <dbl> 0, 0, 0, 0, 0... $ genderMale.infractionfailure.to.cooperate <dbl> 0, 0, 1, 1, 0... FEATURE ENGINEERING IN R

  8. Things to consider Man y categories = possible sparse feat u res Prior kno w ledge of w hat might interact is needed in some regression conte x ts Be s u re to e x plore the di � erent methods a v ailable to determine w hat to cross FEATURE ENGINEERING IN R

  9. It ' s time for y o u to tr y! FE ATU R E E N G IN E E R IN G IN R

  10. Principal component anal y sis FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington

  11. PCA for feat u re engineering FEATURE ENGINEERING IN R

  12. PCA for feat u re engineering FEATURE ENGINEERING IN R

  13. PCA w ith 2 v ariables FEATURE ENGINEERING IN R

  14. PCA w ith 2 v ariables FEATURE ENGINEERING IN R

  15. PCA w ith 2 v ariables FEATURE ENGINEERING IN R

  16. PCA w ith 2 v ariables FEATURE ENGINEERING IN R

  17. Performing PCA u sing princomp glass_x <- glass_df %>% select(-ID, -glass_type) glass_pca <- prcomp(glass_x, center = TRUE, scale. = TRUE) Center = Mean 0 Scale = Unit v ariance FEATURE ENGINEERING IN R

  18. print(glass_pca) Standard deviations (1, .., p=9): [1] 1.58466518 1.43180731 1.18526115 1.07604017 0.95603465 0.72638502 [7] 0.60741950 0.25269141 0.04011007 Rotation (n x k) = (9 x 9): PC1 PC2 PC3 PC4 PC5 RI -0.5451766 0.28568318 -0.0869108293 -0.14738099 0.073542700 Na 0.2581256 0.27035007 0.3849196197 -0.49124204 -0.153683304 Mg -0.1108810 -0.59355826 -0.0084179590 -0.37878577 -0.123509124 Al 0.4287086 0.29521154 -0.3292371183 0.13750592 -0.014108879 Si 0.2288364 -0.15509891 0.4587088382 0.65253771 -0.008500117 K 0.2193440 -0.15397013 -0.6625741197 0.03853544 0.307039842 Ca -0.4923061 0.34537980 0.0009847321 0.27644322 0.188187742 Ba 0.2503751 0.48470218 -0.0740547309 -0.13317545 -0.251334261 Fe -0.1858415 -0.06203879 -0.2844505524 0.23049202 -0.873264047 PC6 PC7 PC8 PC9 RI -0.11528772 -0.08186724 -0.75221590 -0.02573194 Na 0.55811757 -0.14858006 -0.12769315 0.31193718 Mg -0.30818598 0.20604537 -0.07689061 0.57727335 Al 0.01885731 0.69923557 -0.27444105 0.19222686 Si -0.08609797 -0.21606658 -0.37992298 0.29807321 K 0.24363237 -0.50412141 -0.10981168 0.26050863 Ca 0.14866937 0.09913463 0.39870468 0.57932321 Ba -0.65721884 -0.35178255 0.14493235 0.19822820 Fe 0.24304431 -0.07372136 -0.01627141 0.01466944 FEATURE ENGINEERING IN R

  19. It ' s y o u r t u rn ! FE ATU R E E N G IN E E R IN G IN R

  20. Interpreting PCA o u tp u t FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v ersit y of Washington

  21. Determining the v ariation e x plained summary(glass_pca) Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 1.585 1.4318 1.1853 1.0760 0.9560 0.72639 0.6074 Proportion of Variance 0.279 0.2278 0.1561 0.1286 0.1016 0.05863 0.0410 Cumulative Proportion 0.279 0.5068 0.6629 0.7915 0.8931 0.95173 0.9927 PC8 PC9 Standard deviation 0.25269 0.04011 Proportion of Variance 0.00709 0.00018 Cumulative Proportion 0.99982 1.00000 PC 1 and PC 2 acco u nt for 50% of the v ariance PC 1 thro u gh PC 6 acco u nt for 95% of the v ariance FEATURE ENGINEERING IN R

  22. Creating a tibble for plotting prop_var <- tibble(sdev = glass_pca$sdev, pca_comp = 1:n()) prop_var <- prop_var %>% mutate(pcVar = sdev^2, propVar_ex = pcVar/sum(pcVar), pca_comp = as.character(pca_comp)) FEATURE ENGINEERING IN R

  23. Plotting the res u lts ggplot(prop_var, aes(pca_comp, propVar_ex, group = 1)) + geom_line() + geom_point() FEATURE ENGINEERING IN R

  24. E x ploring the o u tcome labels autoplot(glass_pca, data = glass_df, colour = 'glass_type') FEATURE ENGINEERING IN R

  25. PCA considerations Usef u l w hen y o u ha v e a lot of correlated feat u res Each component is u ncorrelated PCA is good w hen there is a linear relationship w ith the response Kernel PCA can acco u nt for non - linearit y FEATURE ENGINEERING IN R

  26. Gi v e it a tr y! FE ATU R E E N G IN E E R IN G IN R

  27. Wrap -u p FE ATU R E E N G IN E E R IN G IN R Jose M Hernande z Data Scientist , Uni v ersit y of Washington

  28. Co u rse w rap -u p : Chapter 1 Categorical data N u mrical representations (0, 1) FEATURE ENGINEERING IN R

  29. Co u rse w rap -u p : Chapter 2 N u merical feat u res and u sed B u cketing / Binning Date stamps to feat u res FEATURE ENGINEERING IN R

  30. Co u rse w rap -u p : Chapter 3 Bo x- Co x and Yeo - Johnson Scaling feat u res Mean centering Z - score standardi z ation FEATURE ENGINEERING IN R

  31. Co u rse w rap -u p : Chapter 4 Crossing feat u res for be � er model performance PCA as a u sef u l feat u re engineering method FEATURE ENGINEERING IN R

  32. Co u rse w rap -u p : F u nctions w e u sed tidyverse packages like : dplyr ggplot caret FEATURE ENGINEERING IN R

  33. Co u rse w rap -u p : E x tension to feat u re enginnering Feat u re engineering for te x t and images FEATURE ENGINEERING IN R

  34. Congrat u lations ! FE ATU R E E N G IN E E R IN G IN R

Recommend


More recommend