naval postgraduate school logistic regression professor
play

Naval Postgraduate School Logistic Regression Professor Ron Fricker - PDF document

1 Naval Postgraduate School Logistic Regression Professor Ron Fricker Monterey, California for Survey Data Goals for this Lecture Introduction to logistic regression Discuss when and why it is useful Interpret output Odds


  1. 1 Naval Postgraduate School Logistic Regression Professor Ron Fricker Monterey, California for Survey Data

  2. Goals for this Lecture • Introduction to logistic regression – Discuss when and why it is useful – Interpret output • Odds and odds ratios – Illustrate use with examples • Show how to run in JMP • Discuss other software for fitting linear and logistic regression models to complex survey data 2

  3. Logistic Regression • Logistic regression – Response ( Y ) is binary representing event or not – Model, where p i =Pr( Y i =1): ⎛ ⎞ = p β + β + β + + β ⎜ ⎟ K i ln 1 X X X − 0 1 1 2 2 i i k ki ⎝ ⎠ p i • In surveys, useful for modeling: – Probability respondent says “yes” (or “no”) • Can also dichotomize other questions – Probability respondent in a (binary) class 3

  4. Why Logistic Regression? • Some reasons: – Resulting “S” curve fits many observed phenomenon – Model follows the same general principles as linear regression • Can estimate probability p of binary outcome ( ) β ˆ + β ˆ + β ˆ + + β ˆ K exp x x x 0 1 1 2 2 k k = ˆ p ( ) ˆ ˆ ˆ ˆ + β + β + β + + β K 1 exp x x x 0 1 1 2 2 k k – Estimates of p bounded between 0 and 1 4

  5. Linear Regression with Binary Y s • Example: modeling presence or absence of coronary heart disease (CHD) as a function of age • Data looks like this: ID Age CHD 1 20 0 – 100 obs 2 23 0 3 24 0 – min age = 20 4 25 0 – max age = 69 5 25 1 6 26 0 – 43 w/ CHD 7 26 0 8 28 0 . . 5 . . . . . . .

  6. Modeling CHD Existence • Imagine each subject flips a coin: Heads = CHD Tails = no CHD • Each coin has a different probability of heads related to subject’s age • Only observe existence of CHD – y =1, has CHD; y =0, does not • We want to model the chance of getting CHD as a function of age 6

  7. Proportion with CHD by Age CHD Age Group n Absent Present Proportion 20-29 10 9 1 0.10 30-34 15 13 2 0.13 35-39 12 9 3 0.25 40-44 15 10 5 0.33 45-49 13 7 6 0.46 50-54 8 3 5 0.63 55-59 17 4 13 0.76 60-69 10 2 8 0.80 Total 100 57 43 0.43 7

  8. 8 70 60 Mean Group Age 50 Plotting the Proportions 40 30 20 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Proportion w / CHD

  9. Interpreting Model Results 1 0.8 p(CHD) 0.6 0.4 0.2 0 10 30 50 70 90 Age If age is 50 years then the probability of CHD is about 0.56 9

  10. Logistic Regression: The Picture 1.2 1 Probability of CHD 0.8 0.6 data 0.4 p (age) 0.2 0 -0.2 0 10 20 30 40 50 60 70 80 90 100 Age 10

  11. Where Logistic Regression Fits Independent or Predictor Variable Continuous Categorical Dependent or Response Continuous Linear reg. Linear w/ dummy regression variables Logistic reg. Categorical Logistic w/ dummy regression variables 11

  12. Logistic Regression in JMP • Fit much like multiple regression: Analyze > Fit Model – Fill in Y with nominal binary dependent variable – Put X s in model by highlighting and then clicking “Add” • Use “Remove” to take out X s – Click “Run Model” when done • Takes care of missing values and non- numeric data automatically 12

  13. Estimating the Parameters • JMP estimates β s via maximum likelihood • Given estimated β s, probabilities estimated as ( ) β ˆ + β ˆ + β ˆ + + β ˆ K exp x x x 0 1 1 2 2 3 k = ˆ p ( ) ˆ ˆ ˆ ˆ + β + β + β + + β K 1 exp x x x 0 1 1 2 2 3 k • Calculating probabilities in JMP is easy – After Fit Model, red triangle > Save Probability Formula 13

  14. Probability, Odds, and Log Odds • Probability ( p ) – Number between 0 and 1 – Example: Pr(Red Sox win next World Series) = 5/8 = 0.62 • Odds: p /( 1- p ) – Any number > 0 – Example: Odds Red Sox win World Series are 5/3 = 1.667 • Log odds: ln( p / 1- p ) – Any number from - ¶ to + ¶ – Log odds is sometimes called the “logit” 14

  15. Interpreting the β s “slope” p -value Log odds of having CHD • Slope is positive and significant – Increasing age means higher probability of coronary heart disease – Increase Age by 1 year and log odds of CHD increases by 0.11 – No t -test, χ -square test instead • p -value still means the same thing 15

  16. Final Model and Results − + exp( 5.31 0.111x ) age = + ˆ(CHD) p − + 1 exp( 5.31 0.111x ) age 1 Age can be any (positive) number and answer still makes sense 0.8 p (C H D ) 0.6 0.4 0.2 0 10 30 50 70 90 Age 16

  17. Odds Ratios – An Example • An odds ratio is, literally, ratio of two odds – Example from some recent (non-survey) work: • Odds IAer retained = 2.01 • Odds non-IAer retained = 1.55 • Odds ratio = 1.30 17

  18. Interpreting the Slope of an Indicator Variable • Let x 1 be an indicator variable – Say, x 1 =1 means male and x 1 =0 means female • Consider the ratio of two logistic regression models, one for males and one for females: ⎛ ⎞ ⎛ ⎞ β + β + β + + β K |male |female X X p p = 0 1 2 2 ⎜ i ⎟ ⎜ i ⎟ i k ki ln ln − − β + β + + β K ⎝ 1 |male ⎠ ⎝ 1 |female ⎠ p p X X 0 2 2 i i i k ki • Exponentiate numerator and denominator: β β β β L exp( )exp( )exp( ) exp( ) X X = β = 0 1 2 2 i k ki exp( ) O. . R β β β 1 L exp( )exp( ) ex p( ) X X 0 2 2 i k ki 18

  19. Example: Using Logistic Regression in NPS New Student Survey • Dichotomize Q1 into “satisfied” (4 or 5) and “not satisfied” (1, 2, or 3) • Model satisfied on Gender and Type Student 19

  20. 20 Compare the Output to Raw Data

  21. Regression in Complex Surveys • Parameters are fit to minimize the sums of squared errors to the population: N ( ) ∑ [ ] 2 = − + SSE y B B x 0 1 i i = 1 i • Resulting estimators: = ∑ ∑ ∑ ∑ ∑ ∑ − ˆ − w y B w w w x y w y w x w 1 i i i i i i i i i i i i ˆ ∈ ∈ and ˆ = ∈ ∈ ∈ ∈ i S i S i S i S i S i S B B ∑ 0 1 2 ⎛ ⎞ w ∑ ∑ ∑ −⎜ i 2 w x w x ⎟ w ∈ i S i i i i i ⎝ ⎠ ∈ ∈ ∈ i S i S i S • Still need to estimate standard errors… 21

  22. Using SAS for Regression • SAS procedures for regression assuming SRS: – PROC REG – PROC LOGISTIC • In SAS v9.1 for complex surveys – PROC SURVEYREG – PROC SURVEYLOGISTIC • See http://support.sas.com/onlinedoc/913/docMainpage.jsp 22

  23. Using Stata for Regression • Stata 9: SVY procedures for regression include – svy:regress – svy:logistic – svy:logit • See www.stata.com/stata9/svy.html for more detail 23

  24. Using R / S+ for Regression • ‘survey’ package by Thomas Lumley – Must install as library for S+ or R – Copy up on Blackboard • Has svyglm for generalized linear models • If like usual glm in S+, can do linear and logistic modeling – But I need to look more closely at it… • See http://faculty.washington.edu/tlumley/survey/ 24

  25. What We Have Just Learned • Introduced logistic regression – Discussed when and why it is useful – Interpreted output • Odds and odds ratios – Illustrated use with examples • Showed how to run in JMP • Discussed other software for fitting linear and logistic regression models to complex survey data 25

Recommend


More recommend