regression simple and linear
play

Regression: Simple and Linear Introduction to Machine Learning - PowerPoint PPT Presentation

INTRODUCTION TO MACHINE LEARNING Regression: Simple and Linear Introduction to Machine Learning Regression Principle REGRESSION PREDICTORS RESPONSE Introduction to Machine Learning Example Shop Data: sales, competition, district size, ...


  1. INTRODUCTION TO MACHINE LEARNING Regression: Simple and Linear

  2. Introduction to Machine Learning Regression Principle REGRESSION PREDICTORS RESPONSE

  3. Introduction to Machine Learning Example Shop Data: sales, competition, district size, ... Data Analyst Relationship? ● Predictors: competition, advertisement, … ● Response: sales Shopkeeper Predictions!

  4. Introduction to Machine Learning Simple Linear Regression ● Simple: one predictor to model the response ● Linear: approximately linear relationship Linearity Plausible? Sca � erplot!

  5. Introduction to Machine Learning Example ● Relationship: advertisement sales ● Expectation: positively correlated

  6. Introduction to Machine Learning Example ● Observation : upwards linear trend ● First Step : simple linear regression 500 400 sales 300 200 100 0 5 10 15 advertisement

  7. Introduction to Machine Learning Model Fi � ing a line ● ● Predictor: Intercept: ● ● Response: Slope: ● Statistical Error:

  8. Introduction to Machine Learning Estimating Coe ffi cients Residuals 500 500 500 True Response Fi � ed Response 400 400 400 sales 300 300 300 Sales 200 200 200 #Observations 100 100 100 Minimize! 0 0 0 5 10 15 5 10 15 5 10 15 advertisement advertisement Advertisement

  9. Introduction to Machine Learning Estimating Coe ffi cients 500 Response Predictor 400 > my_lm <- lm(sales ~ ads, data = shop_data) 300 200 100 > my_lm$coefficients Returns coe ffi cients 0 5 10 15

  10. Introduction to Machine Learning Prediction with Regression Predicting new outcomes , Estimated Coe ffi cients New Predictor Instance Estimated Response Sales: 380.000$ Example: Ads: 11.000$ Must be data frame > y_new <- predict(my_lm, x_new, interval = "confidence") Provides confidence interval

  11. Introduction to Machine Learning Accuracy: RMSE Estimated Response Measure of accuracy: # Observations True Response Example: RMSE = 76.000$ Meaning? RMSE has unit + scale di ffi cult to interpret!

  12. Introduction to Machine Learning Accuracy: R-squared Sample mean response Total SS R-squared Interpretation: % explained variance, close to 1 good fit! > summary(my_lm)$r.squared Example: 0.84

  13. INTRODUCTION TO MACHINE LEARNING Let’s practice!

  14. INTRODUCTION TO MACHINE LEARNING Multivariable Linear Regression

  15. Introduction to Machine Learning Example Simple Linear Regression: 500 500 500 400 400 400 > lm(sales ~ ads, data = shop_data) sales sales 300 300 300 > lm(sales ~ comp, data = shop_data) 200 200 200 100 100 100 0 0 0 Loss of information! 5 10 15 0 0 5 5 10 10 15 15 nearby competition nearby competition

  16. Introduction to Machine Learning Multi-Linear Model Solution: combine in multi linear model! ● Higher predictive power ● Higher accuracy Individual E ff ect

  17. Introduction to Machine Learning Multi-Linear Regression Model ● Predictors: ● Response: ● Statistical Error: Coe ffi cients : ●

  18. Introduction to Machine Learning Estimating Coe ffi cients Residuals Fi � ed Response True Response #Observations Minimize!

  19. Introduction to Machine Learning Extending! More predictors: total inventory, district size, … Extend methodology to p predictors: Response Predictors > my_lm <- lm(sales ~ ads + comp + ..., data = shop_data)

  20. Introduction to Machine Learning RMSE & Adjusted R-Squared { More predictors Higher complexity and cost Lower RMSE and higher R-squared Solution: adjusted R-squared ● Penalizes more predictors ● Used to compare > summary(my_lm)$adj.r.squared In Example: 0.819 0.906

  21. Introduction to Machine Learning Influence of predictors ● p-value: indicator influence of parameter ● p-value low — more likely parameter has significant influence > summary(my_lm) Call: lm(formula = sales ~ ads + comp, data = shop_data) Residuals: Min 1Q Median 3Q Max -131.920 -23.009 -4.448 33.978 146.486 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 228.740 80.592 2.838 0.009084 ** P-Values ads 25.521 5.900 4.325 0.000231 *** comp -19.234 4.549 -4.228 0.000296 ***

  22. Introduction to Machine Learning Example ● Want 95% confidence — p-value <= 0.05 ● Want 99% confidence — p-value <= 0.01 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 228.740 80.592 2.838 0.009084 ** P-Values ads 25.521 5.900 4.325 0.000231 *** comp -19.234 4.549 -4.228 0.000296 *** Note: Do not mix up R-squared with p-values!

  23. Introduction to Machine Learning Assumptions ● Just make a model, 
 make a summary and 
 look at p-values? ● Not that simple! ● We made some implicit assumptions

  24. Introduction to Machine Learning Verifying Assumptions Normal Q − Q Plot Residual Plot 150 150 Residuals: 100 100 Residual Quantiles 50 50 ● Independent: No pa � ern? Residuals 0 0 ● Identical Normal: Approximately a line? − 50 − 50 − 100 − 100 0 − 2 100 − 1 200 300 0 400 1 500 600 2 Theoretical Quantiles Estimated Sales > plot(lm_shop$fitted.values, lm_shop$residuals) > qqnorm(lm_shop$residuals) Draws normal Q-Q plot

  25. Introduction to Machine Learning Verfiying Assumptions Normal Q − Q Plot Residual Plot 150 150 100 100 Residual Quantiles 50 50 Residuals 0 0 − 50 − 50 − 100 − 100 0 100 200 300 400 500 600 − 2 − 1 0 1 2 Estimated Sales Theoretical Quantiles ● Important to avoid mistakes! ● Alternative tests exist

  26. INTRODUCTION TO MACHINE LEARNING Let’s practice!

  27. INTRODUCTION TO MACHINE LEARNING k-Nearest Neighbors 
 and Generalization

  28. Introduction to Machine Learning Non-Parametric Regression Problem: Visible pa � ern, but not linear 56 54 52 y 50 48 46 44 2 3 4 5 6 x

  29. Introduction to Machine Learning Non-Parametric Regression Problem: Visible pa � ern, but not linear Solutions: ● Transformation Tedious ● Advanced Multi-linear Regression ● non-Parametric Regression Doable

  30. Introduction to Machine Learning Non-Parametric Regression Problem: Visible pa � ern, but not linear Techniques: ● k-Nearest Neighbors ● Kernel Regression ● Regression Trees ● … No parameter estimations required!

  31. Introduction to Machine Learning k-NN: Algorithm Given a training set and a new observation: 48.5 48.0 47.5 y 47.0 46.5 New observation 46.0 45.5 4.5 4.6 4.7 4.8 4.9 5.0 x

  32. Introduction to Machine Learning k-NN: Algorithm Given a training set and a new observation: 1. Calculate the distance in the predictors 48.5 48.5 48.0 48.0 47.5 47.5 y y 47.0 47.0 46.5 46.5 46.0 46.0 45.5 45.5 4.5 4.5 4.6 4.6 4.7 4.7 4.8 4.8 4.9 4.9 5.0 5.0 x x

  33. Introduction to Machine Learning k-NN: Algorithm Given a training set and a new observation: 2. Select the k nearest 48.5 48.5 48.0 48.0 47.5 47.5 k = 4 y y 47.0 47.0 46.5 46.5 46.0 46.0 45.5 45.5 4.5 4.5 4.6 4.6 4.7 4.7 4.8 4.8 4.9 4.9 5.0 5.0 x x

  34. Introduction to Machine Learning k-NN: Algorithm Given a training set and a new observation: 3. Aggregate the response of the k nearest 48.5 48.5 48.0 48.0 Mean of 4 responses 47.5 47.5 y y 47.0 47.0 46.5 46.5 46.0 46.0 45.5 45.5 4.5 4.5 4.6 4.6 4.7 4.7 4.8 4.8 4.9 4.9 5.0 5.0 x x

  35. Introduction to Machine Learning k-NN: Algorithm Given a training set and a new observation: 4. The outcome is your prediction 48.5 48.5 48.0 48.0 Prediction 47.5 47.5 y y 47.0 47.0 46.5 46.5 46.0 46.0 45.5 45.5 4.5 4.5 4.6 4.6 4.7 4.7 4.8 4.8 4.9 4.9 5.0 5.0 x x

  36. Introduction to Machine Learning Choosing k ● k = 1: Perfect fit on training set but poor predictions ● k = #obs in training set: Mean, also poor predictions Bias - Variance trade o ff ! Reasonable: k = 20% of #obs in training set

  37. Introduction to Machine Learning Generalization in Regression ● Built your own regression model ● Worked on training set ● Does it generalize well?! ● Two techniques ● Hold Out: simply split the dataset ● K-fold cross-validation

  38. Introduction to Machine Learning Hold Out Method for Regression Test set Training set Build regression model on training set Predict the outcome of the test set Calculate the RMSE within test set Calculate RMSE within training set Compare Test RMSE and Training RMSE

  39. Introduction to Machine Learning Under and Overfi � ing Underfit Overfit 10 10 10 8 8 8 6 6 6 4 4 4 y y y 2 2 2 0 0 0 − 2 − 2 − 2 2 3 4 5 6 7 8 2 3 4 5 6 7 8 2 3 4 5 6 7 8 x x x ✘ ✔ ✔ ● ● ● Fit: Fit: Fit: ✔ ✔ ✘ ● ● ● Generalize: Generalize: Generalize: ✘ ✔ ✘ ● ● ● Prediction: Prediction: Prediction:

  40. INTRODUCTION TO MACHINE LEARNING Let’s practice!

Recommend


More recommend