1 Naval Postgraduate School Logistic Regression Professor Ron Fricker Monterey, California for Survey Data
Goals for this Lecture • Introduction to logistic regression – Discuss when and why it is useful – Interpret output • Odds and odds ratios – Illustrate use with examples • Show how to run in JMP • Discuss other software for fitting linear and logistic regression models to complex survey data 2
Logistic Regression • Logistic regression – Response ( Y ) is binary representing event or not – Model, where p i =Pr( Y i =1): ⎛ ⎞ = p β + β + β + + β ⎜ ⎟ K i ln 1 X X X − 0 1 1 2 2 i i k ki ⎝ ⎠ p i • In surveys, useful for modeling: – Probability respondent says “yes” (or “no”) • Can also dichotomize other questions – Probability respondent in a (binary) class 3
Why Logistic Regression? • Some reasons: – Resulting “S” curve fits many observed phenomenon – Model follows the same general principles as linear regression • Can estimate probability p of binary outcome ( ) β ˆ + β ˆ + β ˆ + + β ˆ K exp x x x 0 1 1 2 2 k k = ˆ p ( ) ˆ ˆ ˆ ˆ + β + β + β + + β K 1 exp x x x 0 1 1 2 2 k k – Estimates of p bounded between 0 and 1 4
Linear Regression with Binary Y s • Example: modeling presence or absence of coronary heart disease (CHD) as a function of age • Data looks like this: ID Age CHD 1 20 0 – 100 obs 2 23 0 3 24 0 – min age = 20 4 25 0 – max age = 69 5 25 1 6 26 0 – 43 w/ CHD 7 26 0 8 28 0 . . 5 . . . . . . .
Modeling CHD Existence • Imagine each subject flips a coin: Heads = CHD Tails = no CHD • Each coin has a different probability of heads related to subject’s age • Only observe existence of CHD – y =1, has CHD; y =0, does not • We want to model the chance of getting CHD as a function of age 6
Proportion with CHD by Age CHD Age Group n Absent Present Proportion 20-29 10 9 1 0.10 30-34 15 13 2 0.13 35-39 12 9 3 0.25 40-44 15 10 5 0.33 45-49 13 7 6 0.46 50-54 8 3 5 0.63 55-59 17 4 13 0.76 60-69 10 2 8 0.80 Total 100 57 43 0.43 7
8 70 60 Mean Group Age 50 Plotting the Proportions 40 30 20 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Proportion w / CHD
Interpreting Model Results 1 0.8 p(CHD) 0.6 0.4 0.2 0 10 30 50 70 90 Age If age is 50 years then the probability of CHD is about 0.56 9
Logistic Regression: The Picture 1.2 1 Probability of CHD 0.8 0.6 data 0.4 p (age) 0.2 0 -0.2 0 10 20 30 40 50 60 70 80 90 100 Age 10
Where Logistic Regression Fits Independent or Predictor Variable Continuous Categorical Dependent or Response Continuous Linear reg. Linear w/ dummy regression variables Logistic reg. Categorical Logistic w/ dummy regression variables 11
Logistic Regression in JMP • Fit much like multiple regression: Analyze > Fit Model – Fill in Y with nominal binary dependent variable – Put X s in model by highlighting and then clicking “Add” • Use “Remove” to take out X s – Click “Run Model” when done • Takes care of missing values and non- numeric data automatically 12
Estimating the Parameters • JMP estimates β s via maximum likelihood • Given estimated β s, probabilities estimated as ( ) β ˆ + β ˆ + β ˆ + + β ˆ K exp x x x 0 1 1 2 2 3 k = ˆ p ( ) ˆ ˆ ˆ ˆ + β + β + β + + β K 1 exp x x x 0 1 1 2 2 3 k • Calculating probabilities in JMP is easy – After Fit Model, red triangle > Save Probability Formula 13
Probability, Odds, and Log Odds • Probability ( p ) – Number between 0 and 1 – Example: Pr(Red Sox win next World Series) = 5/8 = 0.62 • Odds: p /( 1- p ) – Any number > 0 – Example: Odds Red Sox win World Series are 5/3 = 1.667 • Log odds: ln( p / 1- p ) – Any number from - ¶ to + ¶ – Log odds is sometimes called the “logit” 14
Interpreting the β s “slope” p -value Log odds of having CHD • Slope is positive and significant – Increasing age means higher probability of coronary heart disease – Increase Age by 1 year and log odds of CHD increases by 0.11 – No t -test, χ -square test instead • p -value still means the same thing 15
Final Model and Results − + exp( 5.31 0.111x ) age = + ˆ(CHD) p − + 1 exp( 5.31 0.111x ) age 1 Age can be any (positive) number and answer still makes sense 0.8 p (C H D ) 0.6 0.4 0.2 0 10 30 50 70 90 Age 16
Odds Ratios – An Example • An odds ratio is, literally, ratio of two odds – Example from some recent (non-survey) work: • Odds IAer retained = 2.01 • Odds non-IAer retained = 1.55 • Odds ratio = 1.30 17
Interpreting the Slope of an Indicator Variable • Let x 1 be an indicator variable – Say, x 1 =1 means male and x 1 =0 means female • Consider the ratio of two logistic regression models, one for males and one for females: ⎛ ⎞ ⎛ ⎞ β + β + β + + β K |male |female X X p p = 0 1 2 2 ⎜ i ⎟ ⎜ i ⎟ i k ki ln ln − − β + β + + β K ⎝ 1 |male ⎠ ⎝ 1 |female ⎠ p p X X 0 2 2 i i i k ki • Exponentiate numerator and denominator: β β β β L exp( )exp( )exp( ) exp( ) X X = β = 0 1 2 2 i k ki exp( ) O. . R β β β 1 L exp( )exp( ) ex p( ) X X 0 2 2 i k ki 18
Example: Using Logistic Regression in NPS New Student Survey • Dichotomize Q1 into “satisfied” (4 or 5) and “not satisfied” (1, 2, or 3) • Model satisfied on Gender and Type Student 19
20 Compare the Output to Raw Data
Regression in Complex Surveys • Parameters are fit to minimize the sums of squared errors to the population: N ( ) ∑ [ ] 2 = − + SSE y B B x 0 1 i i = 1 i • Resulting estimators: = ∑ ∑ ∑ ∑ ∑ ∑ − ˆ − w y B w w w x y w y w x w 1 i i i i i i i i i i i i ˆ ∈ ∈ and ˆ = ∈ ∈ ∈ ∈ i S i S i S i S i S i S B B ∑ 0 1 2 ⎛ ⎞ w ∑ ∑ ∑ −⎜ i 2 w x w x ⎟ w ∈ i S i i i i i ⎝ ⎠ ∈ ∈ ∈ i S i S i S • Still need to estimate standard errors… 21
Using SAS for Regression • SAS procedures for regression assuming SRS: – PROC REG – PROC LOGISTIC • In SAS v9.1 for complex surveys – PROC SURVEYREG – PROC SURVEYLOGISTIC • See http://support.sas.com/onlinedoc/913/docMainpage.jsp 22
Using Stata for Regression • Stata 9: SVY procedures for regression include – svy:regress – svy:logistic – svy:logit • See www.stata.com/stata9/svy.html for more detail 23
Using R / S+ for Regression • ‘survey’ package by Thomas Lumley – Must install as library for S+ or R – Copy up on Blackboard • Has svyglm for generalized linear models • If like usual glm in S+, can do linear and logistic modeling – But I need to look more closely at it… • See http://faculty.washington.edu/tlumley/survey/ 24
What We Have Just Learned • Introduced logistic regression – Discussed when and why it is useful – Interpreted output • Odds and odds ratios – Illustrated use with examples • Showed how to run in JMP • Discussed other software for fitting linear and logistic regression models to complex survey data 25
Recommend
More recommend