E v al u ating a model graphicall y SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector LLC
Plotting Gro u nd Tr u th v s . Predictions A w ell � � ing model A poorl y � � ing model x = y line r u ns thro u gh Points are all on one side of center of points x = y line " line of perfect prediction " S y stematic errors SUPERVISED LEARNING IN R : REGRESSION
The Resid u al Plot A w ell � � ing model A poorl y � � ing model Resid u al : act u al o u tcome - S y stematic errors prediction Good � t : no s y stematic errors SUPERVISED LEARNING IN R : REGRESSION
The Gain C u r v e Meas u res ho w w ell model sorts the o u tcome x- a x is : ho u ses in model - sorted order ( decreasing ) y- a x is : fraction of total acc u m u lated home sales Wi z ard c u r v e : perfect model SUPERVISED LEARNING IN R : REGRESSION
Reading the Gain C u r v e GainCurvePlot(houseprices, "prediction", "price", "Home price model") SUPERVISED LEARNING IN R : REGRESSION
Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION
Root Mean Sq u ared Error ( RMSE ) SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector LLC
What is Root Mean Sq u ared Error ( RMSE )? √ ( pred − y ) 2 RMSE = w here pred − y : the error , or resid u als v ector 2 ( pred − y ) 2 : mean v al u e of ( pred − y ) SUPERVISED LEARNING IN R : REGRESSION
RMSE of the Home Sales Price Model # Calculate error err <- houseprices$prediction - houseprices$price price : col u mn of act u al sale prices ( in tho u sands ) prediction : col u mn of predicted sale prices ( in tho u sands ) SUPERVISED LEARNING IN R : REGRESSION
RMSE of the Home Sales Price Model # Calculate error err <- houseprices$prediction - houseprices$price # Square the error vector err2 <- err^2 SUPERVISED LEARNING IN R : REGRESSION
RMSE of the Home Sales Price Model # Calculate error err <- houseprices$prediction - houseprices$price # Square the error vector err2 <- err^2 # Take the mean, and sqrt it (rmse <- sqrt(mean(err2))) 58.33908 RMSE ≈ 58.3 SUPERVISED LEARNING IN R : REGRESSION
Is the RMSE Large or Small ? # Take the mean, and sqrt it (rmse <- sqrt(mean(err2))) 58.33908 # The standard deviation of the outcome (sdtemp <- sd(houseprices$price)) 135.2694 RMSE ≈ 58.3 sd ( price ) ≈ 135 SUPERVISED LEARNING IN R : REGRESSION
Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION
2 R - Sq u ared ( R ) SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector LLC
2 What is R ? A meas u re of ho w w ell the model � ts or e x plains the data A v al u e bet w een 0-1 near 1: model � ts w ell near 0: no be � er than g u essing the a v erage v al u e SUPERVISED LEARNING IN R : REGRESSION
2 Calc u lating R 2 R is the v ariance e x plained b y the model . RSS 2 R = 1 − SS Tot w here 2 RSS = ( y − prediction ) ∑ Resid u al s u m of sq u ares (v ariance from model ) 2 = ( y − ) ∑ SS y Tot Total s u m of sq u ares (v ariance of data ) SUPERVISED LEARNING IN R : REGRESSION
2 Calc u late R of the Ho u se Price Model : RSS Calc u late error err <- houseprices$prediction - houseprices$price Sq u are it and take the s u m rss <- sum(err^2) price : col u mn of act u al sale prices ( in tho u sands ) pred : col u mn of predicted sale prices ( in tho u sands ) RSS ≈ 136138 SUPERVISED LEARNING IN R : REGRESSION
2 Calc u late R of the Ho u se Price Model : SS Tot Take the di � erence of prices from the mean price toterr <- houseprices$price - mean(houseprices$price) Sq u are it and take the s u m sstot <- sum(toterr^2) RSS ≈ 136138 ≈ 713615 SS Tot SUPERVISED LEARNING IN R : REGRESSION
2 Calc u late R of the Ho u se Price Model (r_squared <- 1 - (rss/sstot) ) 0.8092278 RSS ≈ 136138 ≈ 713615 SS Tot 2 R ≈ 0.809 SUPERVISED LEARNING IN R : REGRESSION
2 Reading R from the lm () model # From summary() summary(hmodel) ... Residual standard error: 60.66 on 37 degrees of freedom Multiple R-squared: 0.8092, Adjusted R-squared: 0.7989 F-statistic: 78.47 on 2 and 37 DF, p-value: 4.893e-14 summary(hmodel)$r.squared 0.8092278 # From glance() glance(hmodel)$r.squared 0.8092278 SUPERVISED LEARNING IN R : REGRESSION
2 Correlation and R rho <- cor(houseprices$prediction, houseprices$price) 0.8995709 rho^2 0.8092278 ρ = cor(prediction, price) = 0.8995709 2 2 ρ = 0.8092278 = R SUPERVISED LEARNING IN R : REGRESSION
2 Correlation and R Tr u e for models that minimi z e sq u ared error : Linear regression GAM regression Tree - based algorithms that minimi z e sq u ared error Tr u e for training data ; NOT tr u e for f u t u re application data SUPERVISED LEARNING IN R : REGRESSION
Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION
Properl y Training a Model SU P E R VISE D L E AR N IN G IN R : R E G R E SSION Nina Z u mel and John Mo u nt Win - Vector , LLC
Models can perform m u ch better on training than the y do on f u t u re data . 2 2 Training R : 0.9; Test R : 0.15 -- O v er � t SUPERVISED LEARNING IN R : REGRESSION
Test / Train Split Recommended method w hen data is plentif u l SUPERVISED LEARNING IN R : REGRESSION
E x ample : Model Female Unemplo y ment Train on 66 ro w s , test on 30 ro w s SUPERVISED LEARNING IN R : REGRESSION
Model Performance : Train v s . Test 2 Training : RMSE 0.71, R 0.8 2 Test : RMSE 0.93, R 0.75 SUPERVISED LEARNING IN R : REGRESSION
Cross - Validation Preferred w hen data is not large eno u gh to split o � a test set SUPERVISED LEARNING IN R : REGRESSION
Cross - Validation SUPERVISED LEARNING IN R : REGRESSION
Cross - Validation SUPERVISED LEARNING IN R : REGRESSION
Cross - Validation SUPERVISED LEARNING IN R : REGRESSION
Create a cross -v alidation plan library(vtreat) splitPlan <- kWayCrossValidation(nRows, nSplits, NULL, NULL) nRows : n u mber of ro w s in the training data nSplits : n u mber folds ( partitions ) in the cross -v alidation e . g , nfolds = 3 for 3-w a y cross -v alidation remaining 2 arg u ments not needed here SUPERVISED LEARNING IN R : REGRESSION
Create a cross -v alidation plan library(vtreat) splitPlan <- kWayCrossValidation(10, 3, NULL, NULL) First fold ( A and B to train , C to test ) splitPlan[[1]] $train 1 2 4 5 7 9 10 $app 3 6 8 Train on A and B , test on C , etc ... split <- splitPlan[[1]] model <- lm(fmla, data = df[split$train,]) df$pred.cv[split$app] <- predict(model, newdata = df[split$app,]) SUPERVISED LEARNING IN R : REGRESSION
Final Model SUPERVISED LEARNING IN R : REGRESSION
E x ample : Unemplo y ment Model 2 R Meas u re t y pe RMSE train 0.7082675 0.8029275 test 0.9349416 0.7451896 cross -v alidation 0.8175714 0.7635331 SUPERVISED LEARNING IN R : REGRESSION
Let ' s practice ! SU P E R VISE D L E AR N IN G IN R : R E G R E SSION
Recommend
More recommend