Introduction to Regression Myra O Regan Myra.ORegan@tcd.ie Room 142 - PowerPoint PPT Presentation

Introduction to Regression Myra O’ Regan Myra.ORegan@tcd.ie Room 142 Lloyd Institute 1

Description of module • Practical module on regression • Focussing on the application of multiple regression • Software • Lots of computer output – will use R sometimes • 2 labs • Some Mathematics but no linear Algebra 2

Topics to be covered • Revision of Simple linear regression • Introduction to Multiple regression • Use of logs and other transformations • Regression Diagnostics • Use of Indicator Variables • Polynomial regression • Building a regression model • Dealing with multicollinearity • Introduction to Logistic regression • Other fun techniques 3

Notes and Books • I use BlackBoard • Sheather, S. J. A Modern Approach to regression with R,, New York:, Springer 2009 • Neter, J., Wasserman, W. & Kutner, M.H. Applied Linear Models , 2 nd edition Boston, Irwin:1989 • Kutner. M. H., Nachtsheim, C.J., Neter, J. & Li, W. Applied Linear Statistical Models, 5 th , Boston: McGraw-Hill, 2005 4

Purpose of regression • To build a model for prediction purposes – Price of diamond from number of carats – Price of a house – Time to process invoices – Measuring the volume of wood in trees • To look at relationships – Factors relating to cot death 5

Netflix competition • Variables were • user, movie, date of grade, grade • Grade was measured from 1 to 5 • 100,480,507 ratings • 480,189 users • 17,770 movies • Movie, title and year of release 6

308 diamnonds, price, colour, clarity and size 9

Initial examination of data • Know the story behind the data • Understand the background • Understand meanings of variables • Look at each variable separately • Check the quality of data • Summary statistics and graphs • How much missing data? 12

Revision of simple linear regression • Manager of a purchasing department of a large company would like to predict average amount of time it takes to process a given number of invoices. Data was collected over a sample of 30 days on the number of invoices and time taken in hours • Three variables Time, Number of Invoices and Day 13

Invoices Time N 30 30 N* 0 0 Mean 130 2.11 SE Mean 13.7 0.165 StDev 74.8 0.905 Minimum 23 0.8 Q1 60 1.425 Median 127.5 2 Q3 190.8 2.8 Maximum 289 4.1 14

Model to fit • 𝑈𝑗𝑛𝑓 𝑗 = α + β ∗ 𝐽𝑜𝑤𝑝𝑗𝑑𝑓𝑡 𝑗 + 𝜁 𝑗 • Linear model • Need estimates of α and β • Need SE for estimates • We use Minitab to calculate estimates of α and β 16

What is going on here? What are the lines? More importantly what are the differences 18

Prediction vs Confidence intervals • Confidence interval • For a given value of x 0 this is an interval for the average value of the dependent variable • Point Estimate ± t *s 𝐸𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑤𝑏𝑚𝑣𝑓 • t has n-(k+1) df where k = no. of predictors • s= 0.330 – what does this measure (𝑦 0 −𝑦 ) 2 • Distance value = 1 𝑜 + (𝑦 𝑗 −𝑦 ) 2 19

Prediction vs Confidence intervals • Prediction interval • For a given value of x 0 this an interval for the particular value of the dependent variable • Point Estimate ± t *s 1 + 𝐸𝑗𝑡𝑢𝑏𝑜𝑑𝑓 𝑤𝑏𝑚𝑣𝑓 • t has n-(k+1) df where k = no. of predictors • s= 0.330 – what doe this measure (𝑦 0 −𝑦 ) 2 • Distance value = 1 𝑜 + (𝑦 𝑗 −𝑦 ) 2 20

Approximate intervals for reasonably large samples 1 • Confidence intervals=2*s* 𝑜 • Prediction intervals = 2*s * 1 + 1 𝑜 21

Example • Let number of invoices = 50 • Where do these numbers come from roughly? 22

ANOVA table… ) 2 • Total sums of squares(SS) = (𝑍 𝑗 − 𝑍 ) 2 • Regression SS= (𝑍 𝑗 − 𝑍 𝑗 ) 2 • Error SS = (𝑍 𝑗 − 𝑍 • What is R 2 ? 23

What happens if we do the following? • Let Invoices=X • Subtract k from each case • What will change? • 𝑈𝑗𝑛𝑓 = α + β ∗ 𝑌 + 𝜁 − 𝑝𝑠𝑗𝑕𝑗𝑜𝑏𝑚 𝑛𝑝𝑒𝑓𝑚 • Time= α + β *(X-k)+ ε = ( α - β k)+ β X+ ε • Slope does not change but intercept does • Intercept = expected value of Time when X=k • Normally we use k=mean of the variable 24

The regression equation is Time = 2.11 + 0.0113 Centered invoices 25

Trees data • Sample of 31 black cherry trees in the Allegheny national Forest in Pennsylvania • Volume in cubic feet • Height in feet • Diameter in inches 54 inches above ground 27

Variable Diameter Height Volume N 31 31 31 N* 0 0 0 Mean 13.248 76 30.17 SE mean 0.564 1.14 2.95 StDev 3.138 6.37 16.44 Minimum 8.3 63 10.2 Q1 11 72 19.1 Median 12.9 76 24.2 Q3 16 80 38.3 Maximum 20.6 87 77 28

What does the F-test mean? • Testing a hypothesis • Null hypothesis H 0 : 𝛾 1 = 𝛾 2 = 0 • Alternative Hypothesis H 1 : Not all β’s =0 • F=254.97, df=(2,28) p<0.001 • Enough evidence against the null hypothesis 35

Interpretation of coefficients • Volume = β 0 + β 1 *Height+ β 2 *Diameter + ε • E(Volume) or Predicted(Volume) or sometimes written as 𝑍 • = -58.0 +0.339*Height+4.71 *Diameter • Constant (-58.0) is the mean response when Height=0 and Diameter=0 • β 1 change in mean response per unit increase in Height when Diameter is held constant (at any value) • Similarly β 2 change in mean response per unit increase in Diameter when Height is held constant (at any value) 36

And a little more • Example let Diameter =12 • E(Volume) =-58.0 +0.339 Height + 4.71 *12 • = -1.48+0.339 Height • Intercept changes but β 1 stays the same. • Effect on mean response of height does not depend on Diameter • We say effects are additive or not to interact • Partial regression coefficients 37

Changing coefficients • Height by itself 1.54 (.38) • Diameter by itself 5.07 (0.25) Multiple regression • Height | Diameter 0.34 (0.13) • Diameter | Height 4.71 (0.26) 38

Sums of squares • Same calculation as before • Sequential sums of squares Diameter & Height • Diameter 7581.8 • Height 102.4 • Sequential sums of squares Height & Diameter • Height 2901.1 • Diameter 4783.0 39

Derived variables • Create a new x from the given x-variables • Could be a transformation or a combination • Use background knowledge to create new variable • Tree crudely modeled by cylinder • 𝑑𝑧𝑚𝑗𝑜𝑒𝑓𝑠 𝑤𝑝𝑚 = 𝜌𝑠 2 𝑦 ℎ𝑢 = 𝜌 4 (𝐸𝑗𝑏𝑛) 2 x ht • ∝ ℎ𝑢 ∗ ( 𝐸𝑗𝑏𝑛) 2 40

Plot first 41

Transform using logs • y=log b a; b y =a; • 2 3 =8; log 2 8=3; • b is called the base • Typical bases are e and 10 • We are going to use base 10 • e is a mathematical number =2.71 • logs to the base e are called natural logs often written as ln 46

Basic rules for logs using base 10 • Log(10) =1 • Log(10) a =a • Log(1)=0 • Log(0) is not defined • Log(x r )=rlog(x) • 10 log(a) =a • Richter scale for measuring earthquake strength is on a log 10 scale 47

And some more • Log(ab) = log(a)+log(b) 𝑏 • log 𝑐 = log 𝑏 − log 𝑐 • 10 ab =(10 a ) b; 10 (a+b) =10 a 10 b ;10 a-b = 10 𝑏 10 𝑐 48

What are we going to do with all this? • Linear Model • We can take logs of X; of Y; or of both; • What we are interested in examining is the interpretation of the coefficients and interpret them in the original scale • We will see later when it is appropriate • Let us start with the model • Y= α + β *log(x) + ε 49

Interpretation of coefficients • A 1 unit increase in log(X) is associated with β increase in Y units • log(X)+1 = log(X) +log(10)= log(10X) • Converting to a percentage • Multiplying X by 10 equivalent to (10-1)*100% change = 900% increase in x e xpected change in Y when X is multiplied by • 𝛾 10 expected change in Y when X increases by • 𝛾 900% 52

And more • For other percentage changes p ( 100+𝑞 ∗ log • p% increase in X = 𝛾 100 ) increase in Y • A 10% increase in X associated with ( 100+10 ∗ log 𝛾 100 ) increase in Y *log(1.1) increase in Y • 𝛾 *0.041 increase in Y • 𝛾 53

What does this mean? • Volume = - 461 + 262 logheight • An increase in 1 in logheight will increase Volume by 262 • Multiplying height by 10 will increase Volume by 262 • A 10% increase in height will increase Volume ( 100+𝑞 ∗ log by 𝛾 100 ) =262*log(1.1)=10.84 54

Next situation • Log(Y)= α + β *X+ ε • A 1 unit increase in X is associated with β increase in log Y units • Log Y + β = 10 (log 𝑧 +𝛾) = 𝑍 ∗ 10 𝛾 • Each 1-unit increase in X multiplies the expected value of Y by 10 β • The effect of a c-unit increase in X is to multiply the expected value of Y by 10 c β 55

More • Calculate ch= 𝑍 ∗ 10 𝛾 • Calculate (ch-1)*100 • Ch=1.20 implies a 20% increase • Ch=.7 implies a 30% decrease 56

Introduction to Regression Myra O Regan Myra.ORegan@tcd.ie Room 142 - PowerPoint PPT Presentation

Introduction to Regression Myra O Regan Myra.ORegan@tcd.ie Room 142 Lloyd Institute 1 Description of module Practical module on regression Focussing on the application of multiple regression Software Lots of computer output

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Regression: Simple and Linear Introduction to Machine Learning Regression Principle REGRESSION

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Linear regression How to measure the accuracy of linear regression models Linear Regression

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Analysis of variance and regression Other types of regression models Other types of regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Composing the uncomposable Some work, work-in-progress and ideas. Stephen Kell

FreeBSD Development for Smarties The quest for a better kernel development environment Lawrence

Scoring rules A different kind of mechanism design problem: how to elicit a good prediction

On Strong NP-completeness of Rational Problems Dominik Wojtczak University of Liverpool CSR

Framework Adjustment 58 and Amendment 23/Groundfish Monitoring Groundfish AP and Committee

De Dela lawa ware re Ri Rive ver r Pr Progr ogram am Dedic De icated to o re

Propositional Logic: Methods of Proof (Part II) This lecture topic: Propositional Logic (two

The GLAM crop model Kathryn Nicklin k.nicklin@leeds.ac.uk GLAM overview The General Large Area

Introduction to Regression Myra O Regan Myra.ORegan@tcd.ie Room 142 - PowerPoint PPT Presentation

Introduction to Regression Myra O Regan Myra.ORegan@tcd.ie Room 142 Lloyd Institute 1 Description of module Practical module on regression Focussing on the application of multiple regression Software Lots of computer output

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Planning and Optimization B2. Regression: Introduction &amp; STRIPS Case Malte Helmert and

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Regression: Simple and Linear Introduction to Machine Learning Regression Principle REGRESSION

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Linear regression How to measure the accuracy of linear regression models Linear Regression

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Analysis of variance and regression Other types of regression models Other types of regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Composing the uncomposable Some work, work-in-progress and ideas. Stephen Kell

FreeBSD Development for Smarties The quest for a better kernel development environment Lawrence

Scoring rules A different kind of mechanism design problem: how to elicit a good prediction

On Strong NP-completeness of Rational Problems Dominik Wojtczak University of Liverpool CSR

Framework Adjustment 58 and Amendment 23/Groundfish Monitoring Groundfish AP and Committee

De Dela lawa ware re Ri Rive ver r Pr Progr ogram am Dedic De icated to o re

Propositional Logic: Methods of Proof (Part II) This lecture topic: Propositional Logic (two

The GLAM crop model Kathryn Nicklin k.nicklin@leeds.ac.uk GLAM overview The General Large Area

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and