@graemeleehickey www.glhickey.com graeme.hickey@liverpool.ac.uk Checking model assumptions with regression diagnostics Graeme L. Hickey University of Liverpool
Co Confl flicts s of f interest • None • Assistant Editor (Statistical Consultant) for EJCTS and ICVTS
Question : who routinely checks model assumptions when analyzing data? (raise your hand if the answer is Yes )
Ou Outline • Illustrate with multiple linear regression • Plethora of residuals and diagnostics for other model types • Focus is not to “what to do if you detect a problem”, but “how to diagnose (potential) problems”
My My personal experi rience* • Reviewer of EJCTS and ICVTS for 5-years • Authors almost never report if they assessed model assumptions • Example: only one paper submitted where authors considered sphericity in RM-ANOVA at first submission • Usually one or more comment is sent to authors regarding model assumptions * My views do not reflect those of the EJCTS, ICVTS, or of other statistical reviewers
Li Linear r regression mo modelling • Collect some data • 𝑧 " : the observed continuous outcome for subject 𝑗 (e.g. biomarker) • 𝑦 %" , 𝑦 '" , … , 𝑦 )" : p covariates (e.g. age, male, …) • Want to fit the model • 𝑧 " = 𝛾 , + 𝛾 % 𝑦 %" + 𝛾 ' 𝑦 "' + ⋯ + 𝛾 ) 𝑦 )" + 𝜁 " • Estimate the regression coefficients 0 , , 𝛾 0 % , 𝛾 0 ' , … , 𝛾 0 ) • 𝛾 • Report the coefficients and make inference, e.g. report 95% CIs • But we do not stop there…
Re Residuals • For a linear regression model, the residual for the 𝑗 -th observation is 𝑠 " = 𝑧 " − 𝑧 3 " • where 𝑧 3 " is the predicted value given by 0 , + 𝛾 0 % 𝑦 %" + 𝛾 0 ' 𝑦 "' + ⋯ + 𝛾 0 ) 𝑦 )" 𝑧 3 " = 𝛾 • Lots of useful diagnostics are based on residuals
Li Lineari rity of functional form rm • Assumption: scatterplot of (𝑦 " , 𝑠 " ) should not show any systematic trends • Trends imply that higher-order terms are required, e.g. quadratic, cubic, etc.
Fitted model: A B 80 ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Residual ● 5 ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 𝑍 = 𝛾 , + 𝛾 % 𝑌 + 𝜁 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Y ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 5 ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● − 10 0 5 10 15 20 0 5 10 15 20 X X C D 80 8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● Residual ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 𝑍 = 𝛾 , + 𝛾 % 𝑌 + 𝛾 ' 𝑌 ' + 𝜁 ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● ● Y ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 4 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● 0 5 10 15 20 0 5 10 15 20 X X
Ho Homogeneity eneity • We often assume assume that 𝜁 " ∼ 𝑂 0, 𝜏 ' • The assumption here is that the variance is constant, i.e. homogeneous • Estimates and predictions are robust to violation, but not inferences (e.g. F -tests, confidence intervals) • We should not see any pattern in a scatterplot of 𝑧 3 " , 𝑠 " • Residuals should be symmetric about 0
Homoscedastic residuals Heteroscedastic residuals A B ● ● ● ● ● 5 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Residual Residual ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 5 − 5 ● ● ● ● ● ● ● − 10 − 10 0 5 10 15 20 25 0 5 10 15 20 25 Fitted value Fitted value
No Normality • If we want to make inferences, we generally assume 𝜁 " ∼ 𝑂 0, 𝜏 ' • Not always a critical assumption, e.g.: • Want to estimate the ‘best fit’ line • Want to make predictions • The sample size is quite large and the other assumptions are met • We can assess graphically using a Q-Q plot, histogram • Note : the assumption is about the errors, not the outcomes 𝑧 "
Normal residuals Skewed residuals 15 ● ● 6 ● ● ● ● ● Sample Quantiles Sample Quantiles ● ● ● 4 ● ● ● ● ● 10 ● ● ● ●● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 6 ● ● ● − 2 − 1 0 1 2 − 2 − 1 0 1 2 Theoretical Quantiles Theoretical Quantiles 25 30 20 Frequency Frequency 15 20 10 10 5 5 0 0 − 6 − 4 − 2 0 2 4 6 8 0 5 10 15 Residuals Residuals
Independenc Independence • We assume the errors are independent • Usually able to identify this assumption from the study design and analysis plan • E.g. if repeated measures, we should not treat each measurement as independent • If independence holds, plotting the residuals against the time (or order of the observations) should show no pattern
Independent Non-independent A B ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 ● ● 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Residual ● Residual 0 ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● − 100 ● ● ● ● ● ● ● − 60 ● ● − 150 ● 0 25 50 75 100 0 25 50 75 100 X X
Recommend
More recommend