Checking model assumptions with regression diagnostics Graeme L. - PowerPoint PPT Presentation

@graemeleehickey www.glhickey.com graeme.hickey@liverpool.ac.uk Checking model assumptions with regression diagnostics Graeme L. Hickey University of Liverpool

Co Confl flicts s of f interest • None • Assistant Editor (Statistical Consultant) for EJCTS and ICVTS

Question : who routinely checks model assumptions when analyzing data? (raise your hand if the answer is Yes )

Ou Outline • Illustrate with multiple linear regression • Plethora of residuals and diagnostics for other model types • Focus is not to “what to do if you detect a problem”, but “how to diagnose (potential) problems”

My My personal experi rience* • Reviewer of EJCTS and ICVTS for 5-years • Authors almost never report if they assessed model assumptions • Example: only one paper submitted where authors considered sphericity in RM-ANOVA at first submission • Usually one or more comment is sent to authors regarding model assumptions * My views do not reflect those of the EJCTS, ICVTS, or of other statistical reviewers

Li Linear r regression mo modelling • Collect some data • 𝑧 " : the observed continuous outcome for subject 𝑗 (e.g. biomarker) • 𝑦 %" , 𝑦 '" , … , 𝑦 )" : p covariates (e.g. age, male, …) • Want to fit the model • 𝑧 " = 𝛾 , + 𝛾 % 𝑦 %" + 𝛾 ' 𝑦 "' + ⋯ + 𝛾 ) 𝑦 )" + 𝜁 " • Estimate the regression coefficients 0 , , 𝛾 0 % , 𝛾 0 ' , … , 𝛾 0 ) • 𝛾 • Report the coefficients and make inference, e.g. report 95% CIs • But we do not stop there…

Re Residuals • For a linear regression model, the residual for the 𝑗 -th observation is 𝑠 " = 𝑧 " − 𝑧 3 " • where 𝑧 3 " is the predicted value given by 0 , + 𝛾 0 % 𝑦 %" + 𝛾 0 ' 𝑦 "' + ⋯ + 𝛾 0 ) 𝑦 )" 𝑧 3 " = 𝛾 • Lots of useful diagnostics are based on residuals

Li Lineari rity of functional form rm • Assumption: scatterplot of (𝑦 " , 𝑠 " ) should not show any systematic trends • Trends imply that higher-order terms are required, e.g. quadratic, cubic, etc.

Fitted model: A B 80 ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Residual ● 5 ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 𝑍 = 𝛾 , + 𝛾 % 𝑌 + 𝜁 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Y ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 5 ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● − 10 0 5 10 15 20 0 5 10 15 20 X X C D 80 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● Residual ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 𝑍 = 𝛾 , + 𝛾 % 𝑌 + 𝛾 ' 𝑌 ' + 𝜁 ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● ● Y ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 4 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● 0 5 10 15 20 0 5 10 15 20 X X

Ho Homogeneity eneity • We often assume assume that 𝜁 " ∼ 𝑂 0, 𝜏 ' • The assumption here is that the variance is constant, i.e. homogeneous • Estimates and predictions are robust to violation, but not inferences (e.g. F -tests, confidence intervals) • We should not see any pattern in a scatterplot of 𝑧 3 " , 𝑠 " • Residuals should be symmetric about 0

Homoscedastic residuals Heteroscedastic residuals A B ● ● ● ● ● 5 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Residual Residual ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 5 − 5 ● ● ● ● ● ● ● − 10 − 10 0 5 10 15 20 25 0 5 10 15 20 25 Fitted value Fitted value

No Normality • If we want to make inferences, we generally assume 𝜁 " ∼ 𝑂 0, 𝜏 ' • Not always a critical assumption, e.g.: • Want to estimate the ‘best fit’ line • Want to make predictions • The sample size is quite large and the other assumptions are met • We can assess graphically using a Q-Q plot, histogram • Note : the assumption is about the errors, not the outcomes 𝑧 "

Normal residuals Skewed residuals 15 ● ● 6 ● ● ● ● ● Sample Quantiles Sample Quantiles ● ● ● 4 ● ● ● ● ● 10 ● ● ● ●● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 6 ● ● ● − 2 − 1 0 1 2 − 2 − 1 0 1 2 Theoretical Quantiles Theoretical Quantiles 25 30 20 Frequency Frequency 15 20 10 10 5 5 0 0 − 6 − 4 − 2 0 2 4 6 8 0 5 10 15 Residuals Residuals

Independenc Independence • We assume the errors are independent • Usually able to identify this assumption from the study design and analysis plan • E.g. if repeated measures, we should not treat each measurement as independent • If independence holds, plotting the residuals against the time (or order of the observations) should show no pattern

Independent Non-independent A B ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Residual ● Residual 0 ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 100 ● ● ● ● ● ● ● − 60 ● ● − 150 ● 0 25 50 75 100 0 25 50 75 100 X X

Checking model assumptions with regression diagnostics Graeme L. - PowerPoint PPT Presentation

@graemeleehickey www.glhickey.com graeme.hickey@liverpool.ac.uk Checking model assumptions with regression diagnostics Graeme L. Hickey University of Liverpool Co Confl flicts s of f interest None Assistant Editor (Statistical

Regression Diagnostics and Troubleshooting Jeffrey Arnold May 3, 2016 Question How do

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Regression Diagnostics and the Forward Search 1 A. C. Atkinson, London School of Economics

From Model Checking to Proof Checking ... and Back Kedar Namjoshi Bell Labs April 29, 2005

Software Model Checking Using Bogor Software Model Checking Using Bogor a Modular and

Software Model Checking Using Bogor Software Model Checking Using Bogor a Modular and

Software Model Checking Using Bogor Software Model Checking Using Bogor a Modular and

Hoare Logic and Model Checking Model Checking Lecture 11: Model checking for Computation Tree

Regression Diagnostics Procedures ASSUMPTIONS UNDERLYING REGRESSION/CORRELATION NORMALITY OF

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Regression Diagnostics Introduction to Regression 1 Why do we need to do all this? Theory

Application of Local Influence Diagnostics to the Buckley-James Model Nazrina Aziz 1 and Dong Q

Model checking perhaps the most important part of applied statistical modelling Simon Wood

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Single Factor Analysis of Variance (ANOVA) Bernd Schr oder logo1 Bernd Schr oder

Hypothesis testing, part 2 With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha

Comparing Nested Models Two regression models are called nested if one contains all the predictors

ANOVA, Single + Multiple Factors, Lending Club data Kaelen Medeiros Product Data Scientist at

CHAPTER 11 ANALYSIS OF VARIANCE ONE-WAY ANALYSIS OF VARIANCE ANOVA is a procedure used to

QstatLab: software for statistical process control and robust engineering I.N.Vuchkov Iniversity

Introduction to Business Statistics QM 220 QM 220 Chapter 13 Dr. Mohammad Zainal Chapter 13:

Statistical Methods by Robert W. Lindeman WPI, Dept. of Computer Science gogo@wpi.edu

Sambuz

Useful Links

Newsletter

Mail Us