ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Special Topics Some complex model-building problems can be handled using the linear regression approach covered up to this point. For example, piecewise regression, including piecewise linear regression and spline regression. Some require more general nonlinear approaches. For example, logistic and probit regression for binary responses. 1 / 15 Special Topics Introduction
ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Logistic Regression Linear regression methods are used to evaluate the impact of various factors on a response. When the response Y is binary (0 or 1), linear methods have problems. Because E ( Y | x ) = P ( Y = 1 | x ), the linear regression model E ( Y | x ) = β 0 + β 1 x will often predict probabilities that are either negative or greater than 1. 2 / 15 Special Topics Logistic Regression
ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II The most common alternative is based on modeling the log odds ratio : π ( x ) = P ( Y = 1 | x ) π ( x ) 1 − π ( x ) = the odds ratio � π ( x ) � log = log odds ratio, or just log odds . 1 − π ( x ) In the logistic regression model, we assume � π ( x ) � log = β 0 + β 1 x 1 + · · · + β k x k . 1 − π ( x ) 3 / 15 Special Topics Logistic Regression
ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Solving for π ( x ), we find exp ( β 0 + β 1 x 1 + · · · + β k x k ) P ( Y = 1 | x ) = π ( x ) = 1 + exp ( β 0 + β 1 x 1 + · · · + β k x k ) . Consequently 1 P ( Y = 0 | x ) = 1 − π ( x ) = 1 + exp ( β 0 + β 1 x 1 + · · · + β k x k ) . As a function of any x j , π ( x ) changes smoothly from 0 to 1. It is increasing if β j > 0, and decreasing if β j < 0. 4 / 15 Special Topics Logistic Regression
ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II The function exp( x ) F ( x ) = 1 + exp( x ) is the cdf of the logistic distribution: curve(exp(x)/(1 + exp(x)), from = -5, to = 5) It is similar to the cdf of the normal distribution with the matching variance ( π 2 3 ): curve(pnorm(x, 0, sqrt(pi^2/3)), add = TRUE, col = "red") 5 / 15 Special Topics Logistic Regression
ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Interpreting the parameters The coefficient β j measures the change in the log odds associated with a change of +1 in x j . So e β j is the proportional change in the odds associated with the same change. When x j is an indicator variable, e β j is often interpreted as the relative risk that Y = 1 for the group where x j = 1, relative to the group where x j = 0. 6 / 15 Special Topics Logistic Regression
ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Example: fraud detection Data are credit card transactions. The response is Y , where � 1 if the transaction is fraudulent Y = 0 otherwise . The predictors are information about the card holder (credit limit, etc.) and about the transaction (amount, etc.). The fitted ˆ π ( x ) can be used to predict the probability that a new transaction will prove to be fraudulent. 7 / 15 Special Topics Logistic Regression
ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Estimation The usual approach to estimating β 0 , β 1 , . . . , β k is by maximum likelihood. It is implemented in proc logistic and proc genmod in SAS, and in the glm() function in R. The names “genmod” and “glm” are abbreviations of generalized linear model , of which logistic regression is a particular case. 8 / 15 Special Topics Logistic Regression
ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Example: collusive bidding in Florida road construction. bids <- read.table("Text/Exercises&Examples/ROADBIDS.txt", header = TRUE) pairs(bids) Using glm() is very similar to using lm() : g <- glm(STATUS ~ NUMBIDS + DOTEST, bids, family = binomial) summary(g) The argument family = binomial specifies that the response, STATUS , has the binomial (strictly, the Bernoulli) distribution. 9 / 15 Special Topics Logistic Regression
ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II The output is also similar to that of lm() . Note that instead of a column of t -values, there is a column of z -values. Like a t -value, a z -value is the ratio of a parameter estimate to its standard error. The label indicates that you test the significance of the parameter using the normal distribution, not the t -distribution. 10 / 15 Special Topics Logistic Regression
ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Because this is not a least squares fit, there are no sums of squares. Deviance plays a similar role. For example, to test the utility of the model, use the statistic Null deviance − Residual deviance = 21 . 756 which, under H 0 : β 1 = β 2 = 0, is χ 2 -distributed with 30 − 28 = 2 degrees of freedom. P ( χ 2 2 ≥ 21 . 756) = 1 . 9 × 10 − 5 , so we reject H 0 . 11 / 15 Special Topics Logistic Regression
ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II You also use deviance to compare nested models, such as the first order model � π ( x ) � log = β 0 + β 1 x 1 + β 2 x 2 1 − π ( x ) against the complete second-order model � π ( x ) � = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 + β 4 x 2 1 + β 5 x 2 log 2 . 1 − π ( x ) summary(glm(STATUS ~ NUMBIDS * DOTEST + I(NUMBIDS^2) + I(DOTEST^2), bids, family = binomial)) 12 / 15 Special Topics Logistic Regression
ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II To test H 0 : β 3 = β 4 = β 5 = 0, the test statistic is (Residual deviance for reduced model) − (Residual deviance for complete model) Under H 0 , this statistic has the χ 2 -distribution with 28 − 25 = 3 degrees of freedom. Here we have 22 . 843 − 13 . 820 = 9 . 023, which we compare with the χ 2 3 -distribution. We find P ( χ 2 3 ≥ 9 . 023) = . 029, so we would reject H 0 at α = . 05 but not at α = . 01. That is, there is some evidence that we need second-order terms. 13 / 15 Special Topics Logistic Regression
ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Prediction Suppose that a new auction has 4 bidders, and the difference between the winning bid and the engineer’s estimate is 30%. What is the probability that the auction was collusive? predict(g, data.frame(NUMBIDS = 4, DOTEST = 30), type = "response", se.fit = TRUE) The probability is . 85, but the standard error of . 13 shows that it is not very well quantified. If you do not specify “ type = "response" ”, the prediction is on the scale of the log odds, not the probability itself. 14 / 15 Special Topics Logistic Regression
ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Do not use the standard error of the predicted probability to construct a confidence interval! You can use a confidence interval for the log odds to construct a corresponding confidence interval for the probability: p <- predict(g, data.frame(NUMBIDS = 4, DOTEST = 30), se.fit = TRUE) logOdds <- p$fit + qnorm(c(.025, .5, .975)) * p$se.fit exp(logOdds) / (1 + exp(logOdds)) 15 / 15 Special Topics Logistic Regression
Recommend
More recommend