STAT 215 Regression Inference Colin Reimer Dawson Oberlin College - PowerPoint PPT Presentation

Outline Regression Inference Simulation Approaches Partitioning Variability STAT 215 Regression Inference Colin Reimer Dawson Oberlin College October 12, 2017 1 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Outline Regression Inference Simulation Approaches Partitioning Variability 2 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Sample vs Population “Best-Fit” Line • For a sample: choose intercept and slope to minimize sum of squared errors. • But this does not yield the “correct” (or even “best”) model for the population, due to sampling error. 2.5 ● ● ● log10(Price ($K)) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2.0 ● ● ● ● ● Population Sample 1 1.5 Sample 2 Sample 3 Sample 4 1000 2000 3000 4000 5000 Area (sq. ft.) 4 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Reminder: Sampling Distributions Sampling Distribution The sampling distribution of a sample statistic (e.g., ˆ β 1 ) for β 1 , or ¯ Y for µ Y ) is the distribution that statistic has across all possible samples from the population. 5 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Predicting Home Prices in Ames, Iowa 750 500 count 250 0 −0.0004 0.0000 0.0004 0.0008 Sample Slope 6 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Two Methods for Estimating Sampling Distribution 1. t -distribution: assumes Normal residuals (along with other regression conditions) 2. Bootstrap distribution: no Normal assumption needed 7 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Normal Residuals ˆ β i − β i If ε ∼ N (0 , σ ε ) , then ∼ t n − 2 SE ˆ β i 1.0 0.8 Density 0.6 0.4 0.2 0.0 −2 −1 0 1 2 3 ^ − β 1 ) SE β 1 ( β 1 ^ 8 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability t -based Confidence Interval CI 1 − α : ˆ β i ± t ∗ (1 − α/ 2) · SE β i n − 2 where t ∗ (1 − α/ 2) represents the 1 − α/ 2 quantile of n − 2 the t n − 2 distribution. 9 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability sample10 <- sample(Ames, 10) ## This would just be our dataset sample.model <- lm(Price ~ Area, data = sample10) summary(sample.model)$coefficients %>% round(digits = 3) Estimate Std. Error t value Pr(>|t|) (Intercept) 154146.420 34633.509 4.451 0.002 Area 16.231 15.503 1.047 0.326 MoE.95 <- qt(0.975, df = 10 - 2) * 15.503 CI.95 <- c(16.231- MoE.95, 16.231 + MoE.95) CI.95 [1] -19.51898 51.98098 10 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability confint(sample.model, level = 0.95) 2.5 % 97.5 % (Intercept) 74281.40618 234011.43423 Area -19.51942 51.98123 11 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Correlation Test and Interval Can also estimate dist. for correlation r using t n − 2 , where � 1 − r 2 SE r = (1) n − 2 CI 1 − α : r ± t ∗ (1 − α/ 2) · SE r (2) n − 2 t obs = r − 0 (3) SE r 12 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Bootstrap Distribution 14 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Bootstrap Distribution Figure: Our actual sample Figure: Our simulated population 15 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Illustrated Simulation http://lock5stat.com/statkey 16 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Permutation Test: Slope To test H 0 : β 1 = 0 , we want probability that a random ˆ β 1 is as large or larger than observed ˆ β 1 , assuming H 0 true: β 1 = 0 . Permutation Test: Slope 1. Simulate H 0 by randomly pairing X and Y values, and computing ˆ β 1 for each pseudodataset. 2. Repeat many times 3. Calculate proportion of random ˆ β 1 that exceed actual ˆ β 1 . This is the P -value of the test. 4. If P < α for predetermined α , reject H 0 . 17 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Permutation Test: Correlation To test H 0 : ρ = 0 , we want probability that a random r is as large or larger than observed r , assuming H 0 true: ρ = 0 . Permutation Test 1. Simulate H 0 by randomly pairing X and Y values, and computing r for each pseudodataset. 2. Repeat many times 3. Calculate proportion of random r that exceed actual r . This is the P -value of the test. 4. If P < α for predetermined α , reject H 0 . 18 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Illustrated Simulation http://lock5stat.com/statkey 19 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability ANOVA for Regression Y = f ( X ) + ε DATA = PATTERN + IDIOSYNCRACIES Total Variation = Explained Variation + Unexplained Variation Y − ¯ Y = ˆ Y − ¯ Y + Y − ˆ Y Y ) 2 = Y ) 2 + 0 + � ( Y i − ¯ � (ˆ Y i − ¯ � ( Y i − ˆ Y i ) 2 i i SSTotal = SSModel + SSError 21 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability “Omnibus” F -test for a Regression Model F = SSModel/d f Model = MS Model SSError/d f Error MS Error This statistic has an F distribution with corresponding d f if the null model is correct (i.e., Y = β 0 + ε ) BrainBodyWeight <- read.file("http://colindawson.net/data/BrainBodyWeight.csv") brain.model <- lm(log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight) anova(brain.model) Analysis of Variance Table Response: log(brain.weight.grams) Df Sum Sq Mean Sq F value Pr(>F) log(body.weight.kilograms) 1 336.19 336.19 697.42 < 2.2e-16 *** Residuals 60 28.92 0.48 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 22 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Proportion of Variability Explained The Coefficient of Determination ( R 2 ) The coefficient of determination , or R 2 value, associated with a linear model, is the percent reduction in prediction uncertainty achieved by the regression model compared to the null model I.e., what proportion of the variation (variance) in y is “explained”: R 2 = SSModel (4) SSTotal Turns out to just be the square of the correlation! (Show this algebraically) 23 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Example: Restaurant Tips library("Lock5Data"); library("mosaic") data("RestaurantTips") null.tip.model <- lm(Tip ~ 1, data = RestaurantTips) tip.model.using.bill <- lm(Tip ~ Bill, data = RestaurantTips) 15 ● null.tip.model tip.model.using.bill ● ● 10 ● ● ● ● ● ● ● Tip ($) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 10 20 30 40 50 60 70 Total Bill ($) 24 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Example: Restaurant Tips Tip ($) ^ 2 = 5.861 15 σ ε 0 −10 −5 0 5 10 Residual Tip (Null Model) Frequency ^ 2 = 0.953 30 σ ε 0 −10 −5 0 5 10 Residual Tip (Bill Model) 25 / 26

Outline Regression Inference Simulation Approaches Partitioning Variability Regression Summary summary(brain.model) Call: lm(formula = log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight) Residuals: Min 1Q Median 3Q Max -1.71550 -0.49228 -0.06162 0.43597 1.94829 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.13479 0.09604 22.23 <2e-16 *** log(body.weight.kilograms) 0.75169 0.02846 26.41 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.6943 on 60 degrees of freedom Multiple R-squared: 0.9208,Adjusted R-squared: 0.9195 F-statistic: 697.4 on 1 and 60 DF, p-value: < 2.2e-16 26 / 26

STAT 215 Regression Inference Colin Reimer Dawson Oberlin College - PowerPoint PPT Presentation

Outline Regression Inference Simulation Approaches Partitioning Variability STAT 215 Regression Inference Colin Reimer Dawson Oberlin College October 12, 2017 1 / 26 Outline Regression Inference Simulation Approaches Partitioning

M8S1 - Regression Inference Professor Jarad Niemi STAT 226 - Iowa State University November 29,

STAT 113 Analytic Inference for Regression Colin Reimer Dawson Oberlin College 21-24 April 2017

STAT 830 Non-parametric Inference Basics Handwritten Notes Richard Lockhart Simon Fraser

SUPER FAST 15 MINS SUPER FAST 15 MINS 1300 733 215 1300 733 215 UNLIMITED DATA UNLIMITED DATA

6th Grade Fraction & Decimal Computation 2015-10-20 www.njctl.org Slide 3 / 215 Slide 4 /

STAT 830 Blank Slides for Notes Richard Lockhart SFU STAT 830 Fall 2020 Richard Lockhart

Mathematical approximation Jo Hardin Professor, Pomona College DataCamp Inference for Linear

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

STAT 215 Polynomials, Multicollinearity Colin Reimer Dawson Oberlin College 4 November 2016

STAT 215 Multiple Logistic Regression Colin Reimer Dawson Oberlin College November 16, 2017 1

STAT 215 Logistic Regression II Colin Reimer Dawson Oberlin College November 14, 2017 1 / 33

STAT 215 Hypothesis Testing I Colin Reimer Dawson Oberlin College September 7, 2017 1 / 14

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

S01 - Logistic Regression STAT 401 (Engineering) - Iowa State University April 23, 2018

Model Building: General Strategies, Data Pre-processing, and Partial Least Squares Max Kuhn and

CUTE: A Concolic Unit Testing Engine for C (ACM SIGSOFT Impact Award 2019) Koushik Sen, UC

Models for Count Data and Categorical Response Data Christopher F Baum Boston College and DIW

Stat 4 tox : An open-source R-GUI for the statistical evaluation of in vitro assays in toxicology

Duality as Seen in Basis Light Front Quantization James P. Vary Iowa State University Ames,

Overview of the Final Repair and Capitalization Regulations Sorting Out the Confusion and the Myths

with Energy Storage in Decorah, Iowa July 30, 2020 Housekeeping Join audio: Choose Mic

Produce Safety Educators Call #30 March 26, 2018 Instructions All participants are muted.

STAT 215 Regression Inference Colin Reimer Dawson Oberlin College - PowerPoint PPT Presentation

Outline Regression Inference Simulation Approaches Partitioning Variability STAT 215 Regression Inference Colin Reimer Dawson Oberlin College October 12, 2017 1 / 26 Outline Regression Inference Simulation Approaches Partitioning

M8S1 - Regression Inference Professor Jarad Niemi STAT 226 - Iowa State University November 29,

STAT 113 Analytic Inference for Regression Colin Reimer Dawson Oberlin College 21-24 April 2017

STAT 830 Non-parametric Inference Basics Handwritten Notes Richard Lockhart Simon Fraser

SUPER FAST 15 MINS SUPER FAST 15 MINS 1300 733 215 1300 733 215 UNLIMITED DATA UNLIMITED DATA

6th Grade Fraction &amp; Decimal Computation 2015-10-20 www.njctl.org Slide 3 / 215 Slide 4 /

STAT 830 Blank Slides for Notes Richard Lockhart SFU STAT 830 Fall 2020 Richard Lockhart

Mathematical approximation Jo Hardin Professor, Pomona College DataCamp Inference for Linear

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

STAT 215 Polynomials, Multicollinearity Colin Reimer Dawson Oberlin College 4 November 2016

STAT 215 Multiple Logistic Regression Colin Reimer Dawson Oberlin College November 16, 2017 1

STAT 215 Logistic Regression II Colin Reimer Dawson Oberlin College November 14, 2017 1 / 33

STAT 215 Hypothesis Testing I Colin Reimer Dawson Oberlin College September 7, 2017 1 / 14

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

S01 - Logistic Regression STAT 401 (Engineering) - Iowa State University April 23, 2018

Model Building: General Strategies, Data Pre-processing, and Partial Least Squares Max Kuhn and

CUTE: A Concolic Unit Testing Engine for C (ACM SIGSOFT Impact Award 2019) Koushik Sen, UC

Models for Count Data and Categorical Response Data Christopher F Baum Boston College and DIW

Stat 4 tox : An open-source R-GUI for the statistical evaluation of in vitro assays in toxicology

Duality as Seen in Basis Light Front Quantization James P. Vary Iowa State University Ames,

Overview of the Final Repair and Capitalization Regulations Sorting Out the Confusion and the Myths

with Energy Storage in Decorah, Iowa July 30, 2020 Housekeeping Join audio: Choose Mic

Produce Safety Educators Call #30 March 26, 2018 Instructions All participants are muted.

6th Grade Fraction & Decimal Computation 2015-10-20 www.njctl.org Slide 3 / 215 Slide 4 /