Workshop 7.2a: Introduction to Linear models Murray Logan 19 Jul 2017
Section 1 Revision
Aims of statistical modelling Use samples to: • Describe relationships • Inference testing (relationships/effects) • Predictive models
Mathematical models 12 10 8 y 6 y = β 0 + β 1 x 4 2 0 0 1 2 3 4 5 6 x
Statistical models 12 ● 10 ● ● 8 y 6 ● ● y = β 0 + β 1 x + ε 4 ● ε ~ Norm ( 0 , σ 2 ) ● 2 0 0 1 2 3 4 5 6 x
Linear models 12 ● 10 ● ● 8 y 6 ● ● y = β 0 + β 1 x + ε 4 ● ● 2 0 0 1 2 3 4 5 6 x
Linear models 120 ● 100 ● 80 y 60 ● 40 ● y = β 0 + β 1 x + β 2 x 2 20 ● ● ● 0 0 1 2 3 4 5 6 x
Non-linear models 1500 ● 1000 y y = αβ x 500 ● ● ● ● 0 ● ● 0 1 2 3 4 5 6 x
Linear models y i = + × + β 0 β 1 x 1 ϵ 1 variable = population response + population × predictor + error intercept slope variable � �� � Stoichastic component � �� � � �� � intercept term slope term � �� � Systematic component
Linear models y i = + × + β 0 β 1 x 1 ε 1 response intercept single value × predictor slope vector = + + error single value vector � �� � Stoichastic component � �� � � �� � intercept term slope term � �� � Systematic component
Vectors and Matrices Vector Matrix 3 . 0 1 0 2 . 5 1 1 6 . 0 1 2 5 . 5 1 3 9 . 0 1 4 8 . 6 1 5 12 . 0 1 6 Has length ONLY Has length AND width
Estimation 12 ● 10 ● ● 8 y 6 ● ● y = β 0 + β 1 x + ε 4 ● ● 2 0 0 1 2 3 4 5 6 x Ordinary Least Squares
Estimation Y X 3 0 2.5 1 6 2 5.5 3 9 4 8.6 5 12 6 3 . 0 = β 0 × 1 + β 1 × 0 + ε 1 2 . 5 = β 0 × 1 + β 1 × 1 + ε 1 6 . 0 = β 0 × 1 + β 1 × 2 + ε 2 5 . 5 = β 0 × 1 + β 1 × 3 + ε 3
Estimation 3 . 0 = β 0 × 1 + β 1 × 0 + ε 1 β 0 × 1 β 1 × 1 2 . 5 = + + ε 1 6 . 0 = β 0 × 1 + β 1 × 2 + ε 2 5 . 5 = β 0 × 1 + β 1 × 3 + ε 3 9 . 0 = β 0 × 1 + β 1 × 4 + ε 4 8 . 6 = β 0 × 1 + β 1 × 5 + ε 5 12 . 0 = β 0 × 1 + β 1 × 6 + ε 6 3 . 0 1 0 ε 1 2 . 5 1 1 ε 2 6 . 0 1 2 ( β 0 ) ε 3 5 . 5 = 1 3 × + β 1 ε 4 9 . 0 1 4 � �� � ε 5 8 . 6 1 5 Parameter vector ε 6 12 . 0 1 6 � �� � � �� � � �� � Residual vector Response values Model matrix
Inference testing Ho: β 1 = 0 (slope equals zero) The t-statistic param t = SE param t = β 1 SE β 1
Inference testing Ho: β 1 = 0 (slope equals zero) The t-statistic and the t distribution −4 −2 0 2 4
Section 2 Linear model Assumptions
Assumptions • Independence - unbiased, scale of treatment • Normality - residuals • Homogeneity of variance - residuals • Linearity
Assumptions y l i t r m a N o ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Assumptions c e i a n v a r o f t y n e i o g e H o m ● ● ● ● ● Residuals ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Y ● ● ● ● y ● res ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● X x Predicted x ● ● ● ● ● Residuals ● ● ● ● ● ● ● ● ● ● ● ● res Y y ● y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● X x Predicted x
Assumptions y r i t n e a L i Trendline ● ● 60 ● ● 2.5 ● ● ● ● ● ● ● ● 50 ● 2.0 ● 40 ● ● ● ● ● ● 1.5 ● ● 30 ● ● ● 1.0 ● ● ● 20 ● ● ● ● ● ● ● 0.5 10 ● ● ● ● ● 0 0.0 0 5 10 15 20 25 30 0 5 10 15 20 25 30
Assumptions y r i t n e a L i Loess (lowess) smoother ● 60 ● ● ● ● 50 ● ● 40 ● ● 30 ● ● 20 ● ● ● ● ● 10 ● ● ● ● 0 0 5 10 15 20 25 30 ● 2.5 ● ● ● ● ● ● ● 2.0 ●
Assumptions i t y e a r L i n Spline smoother ● ● 60 ● ● 2.5 ● ● ● ● ● ● ● ● 50 ● 2.0 ● 40 ● ● ● ● ● ● 1.5 ● ● 30 ● ● ● 1.0 ● ● ● 20 ● ● ● ● ● ● ● 0.5 10 ● ● ● ● ● 0 0.0 0 5 10 15 20 25 30 0 5 10 15 20 25 30
Assumptions y i = β 0 + β 1 × x i + ε i ϵ i ∼ N (0 , σ 2 )
Assumptions y i = β 0 + β 1 × x i + ε i ϵ i ∼ N (0 , σ 2 )
Example Make these data and call the data frame DATA Y X 3 0 2.5 1 6 2 5.5 3 9 4 8.6 5 12 6
> DATA <- data.frame (Y= c (3, 2.5, 6.0, 5.5, 9.0, 8.6, 12), X=0:6) Example Make these data and call the data frame DATA Y X 3 0 2.5 1 6 2 5.5 3 9 4 8.6 5 12 6 • try this
148 FERTILIZER 1st Qu.:104.5 1st Qu.: 81.25 : 80.0 Min. : 25.00 Min. YIELD > summary (fert) Median :161.5 169 150 6 > fert <- read.csv ('../data/fertilizer.csv', strip.white=T) 125 5 154 100 Median :137.50 Mean 90 'data.frame': > library (INLA) 84 80 90 154 148 169 206 244 212 248 : int $ YIELD 25 50 75 100 125 150 175 200 225 250 $ FERTILIZER: int 2 variables: 10 obs. of > str (fert) :137.50 :248.0 Max. :250.00 Max. 3rd Qu.:210.5 3rd Qu.:193.75 :163.5 Mean 4 75 > fert.inla <- inla (YIELD ~ FERTILIZER, data=fert) 75 6 148 125 5 154 100 4 90 3 169 80 50 2 84 25 1 FERTILIZER YIELD > fert 150 7 3 248 80 50 2 84 25 1 FERTILIZER YIELD > head (fert) 250 175 10 212 225 9 244 200 8 206 > Worked Examples > summary (fert.inla) Call: inla(formula = YIELD FERTILIZER, data = fert) Time used: Pre-processing Running inla Post-processing Total 0.3043 0.0715 0.0217 0.3974 Fixed effects: mean sd 0.025quant 0.5quant 0.975quant mode kld (Intercept) 51.9341 12.9747 25.9582 51.9335 77.8990 51.9339 0 FERTILIZER 0.8114 0.0836 0.6439 0.8114 0.9788 0.8114 0 The model has no random effects Model hyperparameters: mean sd 0.025quant 0.5quant 0.975quant mode Precision for the Gaussian observations 0.0035 0.0015 0.0012 0.0032 0.007 0.0028 Expected number of effective parameters(std dev): 2.00(0.00) Number of equivalent replicates : 5.00 Marginal log-Likelihood: -61.65
Worked Examples Question: is there a relationship between fertilizer concentration and grass yeild? Linear model: ε ∼ N (0 , σ 2 ) Y i = β 0 + β 1 F i + ε i
> library (car) Example i s l y s a n a t a d a o r y r a t p l o E x > scatterplot (Y~X, data=DATA) 12 ● 10 ● ● 8 Y 6 ● ● 4 ● ● 0 1 2 3 4 5 6 X
> library (car) Example i s l y s a n a t a d a r y a t o l o r E x p > peake <- read.csv ('../data/peake.csv') > scatterplot (SPECIES ~ AREA, data=peake) 25 ● ● ● ● ● ● ● 20 ● ● ● SPECIES 15 ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● 5 ● 0 5000 10000 15000 20000 25000 AREA ● ● ●
smoother=gamLine) + > scatterplot (SPECIES ~ AREA, data=peake, Example i s l y s a n a t a d a o r y r a t p l o E x 25 ● ● ● ● ● ● ● 20 ● ● ● SPECIES 15 ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● 5 ● 0 5000 10000 15000 20000 25000 AREA ● ● ●
Recommend
More recommend