Advanced Section #5: Generalized Linear Models: Logistic Regression and Beyond Nick Stern CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader, and Chris Tanner 1
Outline Motivation 1. Limitations of linear regression • 2. Anatomy Exponential Dispersion Family (EDF) • Link function • 3. Maximum Likelihood Estimation for GLM’s Fischer Scoring • CS109A, P ROTOPAPAS , R ADER 2
Motivation CS109A, P ROTOPAPAS , R ADER 3
Motivation Linear regression framework: % 𝛾 + 𝜗 " 𝑧 " = 𝑦 " Assumptions: 1. Linearity: Linear relationship between expected value and predictors 2. Normality: Residuals are normally distributed about expected value Homoskedasticity: Residuals have constant variance 𝜏 * 3. 4. Independence: Observations are independent of one another CS109A, P ROTOPAPAS , R ADER 4
Motivation Expressed mathematically … • Linearity % 𝛾 𝔽 𝑧 " = 𝑦 " • Normality % 𝛾, 𝜏 * ) 𝑧 " ∼ 𝒪(𝑦 " • Homoskedasticity 𝜏 * (instead of) 𝜏 " * • Independence 𝑞 𝑧 " |𝑧 3 = 𝑞(𝑧 " ) for 𝑗 ≠ 𝑘 CS109A, P ROTOPAPAS , R ADER 5
Motivation What happens when our assumptions break down? CS109A, P ROTOPAPAS , R ADER 6
Motivation We have options within the framework of linear regression Heteroskedasticity Nonlinearity Transform X or Y Weight observations (Polynomial Regression) (WLS Regression) CS109A, P ROTOPAPAS , R ADER 7
Motivation But assuming Normality can be pretty limiting … Consider modeling the following random variables: • Whether a coin flip is heads or tails (Bernoulli) • Counts of species in a given area (Poisson) • Time between stochastic events that occur w/ constant rate (gamma) • Vote counts for multiple candidates in a poll (multinomial) CS109A, P ROTOPAPAS , R ADER 8
Motivation We can extend the framework for linear regression. Enter: Generalized Linear Models Relaxes: • Normality assumption • Homoskedasticity assumption CS109A, P ROTOPAPAS , R ADER 9
Motivation CS109A, P ROTOPAPAS , R ADER 10
Anatomy CS109A, P ROTOPAPAS , R ADER 11
Anatomy Two adjustments must be made to turn LM into GLM 1. Assume response variable comes from a family of distributions called the exponential dispersion family (EDF) . 2. The relationship between expected value and predictors is expressed through a link function . CS109A, P ROTOPAPAS , R ADER 12
Anatomy – EDF Family The EDF family contains: Normal, Poisson, gamma, and more! The probability density function looks like this: 𝑔 𝑧 " |𝜄 " = exp 𝑧 " 𝜄 " − 𝑐 𝜄 " + 𝑑 𝑧 " , 𝜚 " 𝜚 " Where 𝜄 - “ canonical parameter ” 𝜚 - “ dispersion parameter ” 𝑐 𝜄 - “ cumulant function ” 𝑑 𝑧, 𝜚 - “ normalization factor ” CS109A, P ROTOPAPAS , R ADER 13
Anatomy – EDF Family Example: representing Bernoulli distribution in EDF form. PDF of a Bernoulli random variable: @ A 1 − 𝑞 " C D @ A 𝑔 𝑧 " 𝑞 " = 𝑞 " Taking the log and then exponentiating (to cancel each other out) gives: 𝑔 𝑧 " 𝑞 " = exp 𝑧 " log 𝑞 " + 1 − 𝑧 " log 1 − 𝑞 " Rearranging terms … 𝑞 " 𝑔 𝑧 " 𝑞 " = exp 𝑧 " log + log 1 − 𝑞 " 1 − 𝑞 " CS109A, P ROTOPAPAS , R ADER 14
Anatomy – EDF Family Comparing: 𝑔 𝑧 " |𝜄 " = exp 𝑧 " 𝜄 " − 𝑐 𝜄 " 𝑞 " vs. + 𝑑 𝑧 " , 𝜚 " 𝑔 𝑧 " 𝑞 " = exp 𝑧 " log + log 1 − 𝑞 " 𝜚 " 1 − 𝑞 " Choosing: 𝑞 " 𝑐(𝜄 " ) = log 1 + 𝑓 I A 𝜄 " = log 1 − 𝑞 " 𝑑(𝑧 " , 𝜚 " ) = 0 𝜚 " = 1 And we recover the EDF form of the Bernoulli distribution CS109A, P ROTOPAPAS , R ADER 15
Anatomy – EDF Family The EDF family has some useful properties. Namely: 1. 𝔽 𝑧 " ≡ 𝜈 " = 𝑐 N 𝜄 " 2. 𝑊𝑏𝑠 𝑧 " = 𝜚 " 𝑐 NN 𝜄 " (the proofs for these identities are in the notes) Plugging in the values we obtained for Bernoulli, we get back: 𝔽 𝑧 " = 𝑞 " , 𝑊𝑏𝑠 𝑧 " = 𝑞 " (1 − 𝑞 " ) CS109A, P ROTOPAPAS , R ADER 16
Anatomy – Link Function Time to talk about the link function CS109A, P ROTOPAPAS , R ADER 17
Anatomy – Link Function Recall from linear regression that: % 𝛾 𝜈 " = 𝑦 " Does this work for the Bernoulli distribution? % 𝛾 𝜈 " = 𝑞 " = 𝑦 " Solution: wrap the expectation in a function called the link function : % 𝛾 ≡ 𝜃 " 𝜈 " = 𝑦 " *For the Bernoulli distribution, the link function is the “ logit ” function (hence “ logistic ” regression) CS109A, P ROTOPAPAS , R ADER 18
Anatomy – Link Function Link functions are a choice, not a property. A good choice is: 1. Differentiable (implies “ smoothness ” ) 2. Monotonic (guarantees invertibility) 1. Typically increasing so that 𝜈 increases w/ 𝜃 3. Expands the range of 𝜈 to the entire real line Example: Logit function for Bernoulli 𝑞 " 𝜈 " = 𝑞 " = log 1 − 𝑞 " CS109A, P ROTOPAPAS , R ADER 19
Anatomy – Link Function Logit function for Bernoulli looks familiar … 𝑞 " 𝑞 " = log = 𝜄 " 1 − 𝑞 " Choosing the link function by setting 𝜄 " = 𝜃 " gives us what is known as the “ canonical link function . ” Note: 𝜈 " = 𝑐 N 𝜄 " → 𝜄 " = 𝑐 NDC (𝜈 " ) (derivative of cumulant function must be invertible) This choice of link, while not always effective, has some nice properties. Take STAT 149 to find out more! CS109A, P ROTOPAPAS , R ADER 20
Anatomy – Link Function Here are some more examples (fun exercises at home) Mean Function 𝝂 𝒋 = 𝒄 N (𝜾 𝒋 ) Distribution 𝒈(𝒛 𝒋 |𝜾 𝒋 ) Canonical Link 𝜾 𝒋 = 𝒉(𝝂 𝒋 ) 𝜄 " 𝜈 " Normal 𝜈 " 𝑓 I A log Bernoulli/Binomial 1 − 𝜈 " 1 + 𝑓 I A 𝑓 I A Poisson log(𝜈 " ) −1 −1 Gamma 𝜄 " 𝜈 " −1 DC Inverse Gaussian −2𝜄 " * * 2𝜈 " CS109A, P ROTOPAPAS , R ADER 21
Maximum Likelihood Estimation CS109A, P ROTOPAPAS , R ADER 22
Maximum Likelihood Estimation Recall from linear regression – we can estimate our parameters, 𝜄 , by choosing those that maximize the likelihood, 𝑀 𝑧 𝜄) , of the data, where: _ 𝑀 𝑧 𝜄 = ^ 𝑞 𝑧 " 𝜄 " " In words: likelihood is the probability of observing a set of “ N ” independent datapoints, given our assumptions about the generative process. CS109A, P ROTOPAPAS , R ADER 23
Maximum Likelihood Estimation For GLM’s we can plug in the PDF of the EDF family: _ exp 𝑧 " 𝜄 " − 𝑐 𝜄 " 𝑀 𝑧 𝜄 = ^ + 𝑑 𝑧 " , 𝜚 " 𝜚 " "`C How do we maximize this? Differentiate w.r.t. 𝜄 and set equal to 0. Taking the log first simplifies our life: _ 𝑧 " 𝜄 " − 𝑐 𝜄 " _ ℓ 𝑧 𝜄 = b + b 𝑑 𝑧 " , 𝜚 " 𝜚 " "`C "`C CS109A, P ROTOPAPAS , R ADER 24
Maximum Likelihood Estimation Through lots of calculus & algebra (see notes), we can obtain the following form for the derivative of the log-likelihood: _ 1 𝜖𝜈 " ℓ N 𝑧 𝜄 = b 𝜖𝛾 (𝑧 " − 𝜈 " ) 𝑊𝑏𝑠 𝑧 " "`C Setting this sum equal to 0 gives us the generalized estimating equations: _ 1 𝜖𝜈 " b 𝜖𝛾 (𝑧 " − 𝜈 " ) = 0 𝑊𝑏𝑠 𝑧 " "`C CS109A, P ROTOPAPAS , R ADER 25
Maximum Likelihood Estimation When we use the canonical link, this simplifies to the normal equations : _ % 𝑧 " − 𝜈 " 𝑦 " b = 0 𝜚 " "`C Let’s attempt to solve the normal equations for the Bernoulli distribution. Plugging in 𝜈 " and 𝜚 " we get: _ e f 𝑓 d A % = 0 b 𝑧 " − 𝑦 " e f 1 − 𝑓 d A "`C CS109A, P ROTOPAPAS , R ADER 26
Maximum Likelihood Estimation Sad news: we can’t isolate 𝛾 analytically. CS109A, P ROTOPAPAS , R ADER 27
Maximum Likelihood Estimation Good news: we can approximate it numerically. One choice of algorithm is the Fisher Scoring algorithm. In order to find the 𝜄 that maximizes the log-likelihood, ℓ(𝑧|𝜄) : 1. Pick a starting value for our parameter, 𝜄 g . 2. Iteratively update this value as follows: ℓ N (𝜄 " ) 𝜄 "hC = 𝜄 " − 𝔽 ℓ NN 𝜄 " In words: perform gradient ascent with a learning rate inversely proportional to the expected curvature of the function at that point. CS109A, P ROTOPAPAS , R ADER 28
Maximum Likelihood Estimation Here are the results of implementing the Fisher Scoring algorithm for simple logistic regression in python: DEMO CS109A, P ROTOPAPAS , R ADER 29
Questions? CS109A, P ROTOPAPAS , R ADER 30
Recommend
More recommend