The int u ition behind tree - based methods SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC
E x ample : Predict animal intelligence from Gestation Time and Litter Si z e SUPERVISED LEARNING IN R : REGRESSION
Decision Trees R u les of the form : if a AND b AND c THEN y Non - linear concepts inter v als non - monotonic relationships non - additi v e interactions AND : similar to m u ltiplication SUPERVISED LEARNING IN R : REGRESSION
Decision Trees IF Li � er < 1.15 AND Gestation ≥ 268 → intelligence = 0.315 IF Li � er IN [1.15, 4.3) → intelligence = 0.131 SUPERVISED LEARNING IN R : REGRESSION
Decision Trees Pro : Trees Ha v e an E x pressi v e Concept Space Model RMSE linear 0.1200419 tree 0.1072732 SUPERVISED LEARNING IN R : REGRESSION
Decision Trees Con : Coarse - Grained Predictions SUPERVISED LEARNING IN R : REGRESSION
It ' s Hard for Trees to E x press Linear Relationships Trees Predict A x is - Aligned Regions SUPERVISED LEARNING IN R : REGRESSION
It ' s Hard for Trees to E x press Linear Relationships It ' s Hard to E x press Lines w ith Steps SUPERVISED LEARNING IN R : REGRESSION
Other Iss u es w ith Trees Tree w ith too man y splits ( deep tree ): Too comple x - danger of o v er � t Tree w ith too fe w splits ( shallo w tree ): Predictions too coarse - grained SUPERVISED LEARNING IN R : REGRESSION
Ensembles of Trees Ensembles Gi v e Finer - grained Predictions than Single Trees SUPERVISED LEARNING IN R : REGRESSION
Ensembles of Trees Ensemble Model Fits Animal Intelligence Data Be � er than Single Tree Model RMSE linear 0.1200419 tree 0.1072732 random forest 0.0901681 SUPERVISED LEARNING IN R : REGRESSION
Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION
Random forests SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LCC
Random Forests M u ltiple di v erse decision trees a v eraged together Red u ces o v er � t Increases model e x pressi v eness Finer grain predictions SUPERVISED LEARNING IN R : REGRESSION
B u ilding a Random Forest Model 1. Dra w bootstrapped sample from training data 2. For each sample gro w a tree At each node , pick best v ariable to split on ( from a random s u bset of all v ariables ) Contin u e u ntil tree is gro w n 3. To score a dat u m , e v al u ate it w ith all the trees and a v erage the res u lts . SUPERVISED LEARNING IN R : REGRESSION
E x ample : Bike Rental Data cnt ~ hr + holiday + workingday + + weathersit + temp + atemp + hum + windspeed SUPERVISED LEARNING IN R : REGRESSION
Random Forests w ith ranger () model <- ranger(fmla, bikesJan, + num.trees = 500, + respect.unordered.factors = "order") formula , data num.trees ( defa u lt 500) - u se at least 200 mtry - n u mber of v ariables to tr y at each node defa u lt : sq u are root of the total n u mber of v ariables respect.unordered.factors - recommend set to " order " " safe " hashing of categorical v ariables SUPERVISED LEARNING IN R : REGRESSION
Random Forests w ith ranger () model Ranger result ... OOB prediction error (MSE): 3103.623 R squared (OOB): 0.7837386 Random forest algorithm ret u rns estimates of o u t - of - sample performance . SUPERVISED LEARNING IN R : REGRESSION
Predicting w ith a ranger () model bikesFeb$pred <- predict(model, bikesFeb)$predictions predict() inp u ts : model data Predictions can be accessed in the element predictions . SUPERVISED LEARNING IN R : REGRESSION
E v al u ating the model Calc u late RMSE : bikesFeb %>% + mutate(residual = pred - cnt) %>% + summarize(rmse = sqrt(mean(residual^2))) rmse 1 67.15169 Model RMSE Q u asipoisson 69.3 Random forests 67.15 SUPERVISED LEARNING IN R : REGRESSION
E v al u ating the model SUPERVISED LEARNING IN R : REGRESSION
E v al u ating the model SUPERVISED LEARNING IN R : REGRESSION
Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION
One - Hot - Encoding Categorical Variables SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC
Wh y Con v ert Categoricals Man u all y? Most R f u nctions manage the con v ersion for y o u model.matrix() xgboost() does not M u st con v ert categorical v ariables to n u meric representation Con v ersion to indicators : one - hot encoding SUPERVISED LEARNING IN R : REGRESSION
One - hot - encoding and data cleaning w ith `v treat ` Basic idea : designTreatmentsZ() to design a treatment plan from the training data , then prepare() to created " clean " data all n u merical no missing v al u es u se prepare() w ith treatment plan for all f u t u re data SUPERVISED LEARNING IN R : REGRESSION
A Small v treat E x ample Training Data Test Data x u y x u y one 44 0.4855671 one 5 2.6488148 t w o 24 1.3683726 three 12 1.5012938 three 66 2.0352837 one 56 0.1993731 t w o 22 1.6396267 t w o 28 1.2778516 SUPERVISED LEARNING IN R : REGRESSION
Create the Treatment Plan vars <- c("x", "u") treatplan <- designTreatmentsZ(dframe, varslist, verbose = FALSE) Inp u ts to designTreatmentsZ() dframe : training data varlist : list of inp u t v ariable names set v erbose = FALSE to s u ppress progress messages SUPERVISED LEARNING IN R : REGRESSION
Get the Ne w Variables The scoreFrame describes the v ariable mapping and t y pes (scoreFrame <- treatplan$scoreFrame %>% + select(varName, origName, code)) varName origName code 1 x_lev_x.one x lev 2 x_lev_x.three x lev 3 x_lev_x.two x lev 4 x_catP x catP 5 u_clean u clean Get the names of the ne w lev and clean v ariables (newvars <- scoreFrame %>% + filter(code %in% c("clean", "lev")) %>% + use_series(varName)) "x_lev_x.one" "x_lev_x.three" "x_lev_x.two" "u_clean" SUPERVISED LEARNING IN R : REGRESSION
Prepare the Training Data for Modeling training.treat <- prepare(treatmentplan, dframe, varRestriction = newvars) Inp u ts to prepare() : treatmentplan : treatment plan dframe : data frame varRestriction : list of v ariables to prepare ( optional ) defa u lt : prepare all v ariables SUPERVISED LEARNING IN R : REGRESSION
Before and After Data Treatment Training Data Treated Training Data x u y x_ le v x_ le v x_ le v _x. _x. _x. u_ clean one 44 0.4855671 one three t w o t w o 24 1.3683726 1 0 0 44 three 66 2.0352837 0 0 1 24 t w o 22 1.6396267 0 1 0 66 0 0 1 22 SUPERVISED LEARNING IN R : REGRESSION
Prepare the Test Data Before Model Application (test.treat <- prepare(treatplan, test, varRestriction = newvars)) x_lev_x.one x_lev_x.three x_lev_x.two u_clean 1 1 0 0 5 2 0 1 0 12 3 1 0 0 56 4 0 0 1 28 SUPERVISED LEARNING IN R : REGRESSION
v treat Treatment is Rob u st Pre v io u sl y u nseen x le v el : fo u r fo u r encodes to (0, 0, 0) prepare(treatplan, toomany, ...) x u y one 4 0.2331301 x_ le v x_ le v x_ le v _x. _x. _x. u_ clean t w o 14 1.9331760 one three t w o three 66 3.1251029 1 0 0 4 fo u r 25 4.0332491 0 0 1 14 0 1 0 66 0 0 0 25 SUPERVISED LEARNING IN R : REGRESSION
Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION
Gradient boosting machines SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC
Ho w Gradient Boosting Works 1. Fit a shallo w tree T to the 1 data : M = T 1 1 SUPERVISED LEARNING IN R : REGRESSION
Ho w Gradient Boosting Works 1. Fit a shallo w tree T to the 1 data : M = T 1 1 2. Fit a tree T _2 to the resid u als . Find γ s u ch that M = M + γT is the 2 1 2 best � t to data SUPERVISED LEARNING IN R : REGRESSION
Ho w Gradient Boosting Works Reg u lari z ation : learning rate η ∈ (0,1) M = M + ηγT 2 1 2 Larger η : faster learning Smaller η : less risk of o v er � t SUPERVISED LEARNING IN R : REGRESSION
Ho w Gradient Boosting Works 1. Fit a shallo w tree T to the 1 data M = T 1 1 2. Fit a tree T _2 to the resid u als . M = M + ηγ T 2 1 2 2 3. Repeat (2) u ntil stopping condition met Final Model : ∑ M = M + η γ T 1 i i SUPERVISED LEARNING IN R : REGRESSION
Cross -v alidation to G u ard Against O v erfit Training error keeps decreasing , b u t test error doesn ' t SUPERVISED LEARNING IN R : REGRESSION
Best Practice (w ith x gboost ()) 1. R u n xgb.cv() w ith a large n u mber of ro u nds ( trees ). SUPERVISED LEARNING IN R : REGRESSION
Best Practice (w ith x gboost ()) 1. R u n xgb.cv() w ith a large n u mber of ro u nds ( trees ). 2. xgb.cv()$evaluation_log : records estimated RMSE for each ro u nd . Find the n u mber of trees that minimi z es estimated RMSE : n best SUPERVISED LEARNING IN R : REGRESSION
Recommend
More recommend