Categorical inp u ts SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC
E x ample : Effect of Diet on Weight Loss WtLoss24 ~ Diet + Age + BMI Diet Age BMI WtLoss 24 Med 59 30.67 -6.7 Lo w- Carb 48 29.59 8.4 Lo w- Fat 52 32.9 6.3 Med 53 28.92 8.3 Lo w- Fat 47 30.20 6.3 SUPERVISED LEARNING IN R : REGRESSION
model . matri x() model.matrix(WtLoss24 ~ Diet + Age + BMI, data = diet) All n u merical v al u es Con v erts categorical v ariable w ith N le v els into N - 1 indicator v ariables SUPERVISED LEARNING IN R : REGRESSION
Indicator Variables to Represent Categories Original Data Model Matri x Diet Age ... DietLo w- ( Int ) DietMed ... Fat Med 59 ... 1 0 1 ... Lo w- Carb 48 ... 1 0 0 ... Lo w- Fat 52 ... 1 1 0 ... Med 53 ... 1 0 1 ... Lo w- Fat 47 ... 1 1 0 ... reference le v el : " Lo w- Carb " SUPERVISED LEARNING IN R : REGRESSION
Interpreting the Indicator Variables Linear Model : lm(WtLoss24 ~ Diet + Age + BMI, data = diet)) Coefficients: (Intercept) DietLow-Fat DietMed -1.37149 -2.32130 -0.97883 Age BMI 0.12648 0.01262 SUPERVISED LEARNING IN R : REGRESSION
Iss u es w ith one - hot - encoding Too man y le v els can be a problem E x ample : ZIP code ( abo u t 40,000 codes ) Don ' t hash w ith geometric methods ! SUPERVISED LEARNING IN R : REGRESSION
Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION
Interactions SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC
Additi v e relationships E x ample of an additi v e relationship : plant_height ~ bacteria + sun Change in height is the s u m of the e � ects of bacteria and s u nlight Change in s u nlight ca u ses same change in height , independent of bacteria Change in bacteria ca u ses same change in height , independent of s u nlight SUPERVISED LEARNING IN R : REGRESSION
What is an Interaction ? The sim u ltaneo u s in �u ence of t w o v ariables on the o u tcome is not additi v e . plant_height ~ bacteria + sun + bacteria:sun Change in height is more ( or less ) than the s u m of the e � ects d u e to s u n / bacteria At higher le v els of s u nlight , 1 u nit change in bacteria ca u ses more change in height SUPERVISED LEARNING IN R : REGRESSION
What is an Interaction ? The sim u ltaneo u s in �u ence of t w o v ariables on the o u tcome is not additi v e . plant_height ~ bacteria + sun + bacteria:sun sun : categorical {" s u n ", " shade "} In s u n , 1 u nit change in bacteria ca u ses m u nits change in height In shade , 1 u nit change in bacteria ca u ses n u nits change in height Like t w o separate models : one for s u n , one for shade . SUPERVISED LEARNING IN R : REGRESSION
E x ample of no Interaction : So y bean Yield yield ~ Stress + SO2 + O3 SUPERVISED LEARNING IN R : REGRESSION
E x ample of an Interaction : Alcohol Metabolism Metabol ~ Gastric + Sex SUPERVISED LEARNING IN R : REGRESSION
E x pressing Interactions in Form u lae Interaction - Colon ( : ) y ~ a:b Main e � ects and interaction - Asterisk ( * ) y ~ a*b # Both mean the same y ~ a + b + a:b E x pressing the prod u ct of t w o v ariables - I y ~ I(a*b) same as y ∝ ab SUPERVISED LEARNING IN R : REGRESSION
Finding the Correct Interaction Pattern Form u la RMSE ( cross v alidation ) Metabol ~ Gastric + Sex 1.46 Metabol ~ Gastric * Sex 1.48 Metabol ~ Gastric + Gastric:Sex 1.39 SUPERVISED LEARNING IN R : REGRESSION
Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION
Transforming the response before modeling SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC
The Log Transform for Monetar y Data Monetar y v al u es : lognormall y distrib u ted Long tail , w ide d y namic range (60-700 K ) SUPERVISED LEARNING IN R : REGRESSION
Lognormal Distrib u tions mean > median (~ 50 K v s 39 K ) Predicting the mean w ill o v erpredict t y pical v al u es SUPERVISED LEARNING IN R : REGRESSION
Back to the Normal Distrib u tion For a Normal Distrib u tion : mean = median ( here : 4.53 v s 4.59) more reasonable d y namic range (1.8 - 5.8) SUPERVISED LEARNING IN R : REGRESSION
The Proced u re 1. Log the o u tcome and � t a model model <- lm(log(y) ~ x, data = train) SUPERVISED LEARNING IN R : REGRESSION
The Proced u re 1. Log the o u tcome and � t a model model <- lm(log(y) ~ x, data = train) 2. Make the predictions in log space logpred <- predict(model, data = test) SUPERVISED LEARNING IN R : REGRESSION
The Proced u re 1. Log the o u tcome and � t a model model <- lm(log(y) ~ x, data = train) 2. Make the predictions in log space logpred <- predict(model, data = test) 3. Transform the predictions to o u tcome space pred <- exp(logpred) SUPERVISED LEARNING IN R : REGRESSION
Predicting Log - transformed O u tcomes : M u ltiplicati v e Error log ( a ) + log ( b ) = log ( ab ) log ( a ) − log ( b ) = log ( a / b ) M u ltiplicati v e error : pred / y pred Relati v e error : ( pred − y )/ y = − 1 y Red u cing m u ltiplicati v e error red u ces relati v e error . SUPERVISED LEARNING IN R : REGRESSION
Root Mean Sq u ared Relati v e Error √ ( pred − y 2 ) y RMS - relati v e error = Predicting log - o u tcome red u ces RMS - relati v e error B u t the model w ill o � en ha v e larger RMSE SUPERVISED LEARNING IN R : REGRESSION
E x ample : Model Income Directl y modIncome <- lm(Income ~ AFQT + Educ, data = train) AFQT : Score on pro � cienc y test 25 y ears before s u r v e y Educ : Years of ed u cation to time of s u r v e y Income : Income at time of s u r v e y SUPERVISED LEARNING IN R : REGRESSION
Model Performance test %>% + mutate(pred = predict(modIncome, newdata = test), + err = pred - Income) %>% + summarize(rmse = sqrt(mean(err^2)), + rms.relerr = sqrt(mean((err/Income)^2))) RMSE RMS - relati v e error 36,819.39 3.295189 SUPERVISED LEARNING IN R : REGRESSION
Model log ( Income ) modLogIncome <- lm(log(Income) ~ AFQT + Educ, data = train) SUPERVISED LEARNING IN R : REGRESSION
Model Performance test %>% + mutate(predlog = predict(modLogIncome, newdata = test + pred = exp(predlog), + err = pred - Income) %>% + summarize(rmse = sqrt(mean(err^2)), + rms.relerr = sqrt(mean((err/Income)^2))) RMSE RMS - relati v e error 38,906.61 2.276865 SUPERVISED LEARNING IN R : REGRESSION
Compare Errors log(Income) model : smaller RMS - relati v e error , larger RMSE Model RMSE RMS - relati v e error On Income 36,819.39 3.295189 On log(Income) 38,906.61 2.276865 SUPERVISED LEARNING IN R : REGRESSION
Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION
Transforming inp u ts before modeling SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector LLC
Wh y To Transform Inp u t Variables Domain kno w ledge / s y nthetic v ariables Intelligence ~ mass . brain / mass . body 2/3 SUPERVISED LEARNING IN R : REGRESSION
Wh y To Transform Inp u t Variables Domain kno w ledge / s y nthetic v ariables 2/3 Intelligence ~ mass . brain / mass . body Pragmatic reasons Log transform to red u ce d y namic range Log transform beca u se meaningf u l changes in v ariable are m u ltiplicati v e SUPERVISED LEARNING IN R : REGRESSION
Wh y To Transform Inp u t Variables Domain kno w ledge / s y nthetic v ariables 2/3 Intelligence ~ mass . brain / mass . body Pragmatic reasons Log transform to red u ce d y namic range Log transform beca u se meaningf u l changes in v ariable are m u ltiplicati v e y appro x imatel y linear in f ( x ) rather than in x SUPERVISED LEARNING IN R : REGRESSION
E x ample : Predicting An x iet y SUPERVISED LEARNING IN R : REGRESSION
Transforming the hassles v ariable SUPERVISED LEARNING IN R : REGRESSION
Different possible fits Which is best ? anx ~ I(hassles^2) anx ~ I(hassles^3) anx ~ I(hassles^2) + I(hassles^3) anx ~ exp(hassles) ... I() : treat an e x pression literall y ( not as an interaction ) SUPERVISED LEARNING IN R : REGRESSION
Compare different models Linear , Q u adratic , and C u bic models mod_lin <- lm(anx ~ hassles, hassleframe) summary(mod_lin)$r.squared 0.5334847 mod_quad <- lm(anx ~ I(hassles^2), hassleframe) summary(mod_quad)$r.squared 0.6241029 mod_tritic <- lm(anx ~ I(hassles^3), hassleframe) summary(mod_tritic)$r.squared 0.6474421 SUPERVISED LEARNING IN R : REGRESSION
Compare different models Use cross -v alidation to e v al u ate the models Model RMSE Linear ( hassles ) 7.69 2 Q u adratic ( hassles ) 6.89 3 C u bic ( hassles ) 6.70 SUPERVISED LEARNING IN R : REGRESSION
Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION
Recommend
More recommend