Logistic Regression CSCI 447/547 MACHINE LEARNING
Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss Function Minimizing Log Likelihood Function Batch/Full Logistic Regression Gradient Descent for OLS Gradient Descent for Logistic Regression Comparing OLS and Logistic Regression Multi-Class Logistic Regression Using Softmax
OLS Recap Linear regression Predicts continuous and potentially unbounded labels based on given features ˆ Y = XB B are the coefficients, X is the data matrix The Issue: Unbounded output means we cannot use it with discrete classification
Logistic Regression Classification Predicts discrete labels based on given features The Setup y = 0 or 1 with probabilities 1-p and p Predict a probability instead of a value Estimate P(y=1|X)
Definitions Link Function Relates mean of distribution to output of linear model Convert unbounded to bounded predictions Convert continuous output to discrete interpretation Typically the link function is exponentially distributed Logistic Function Input is ∞ to - ∞ and output is 0 to 1 f(- ∞, ∞) -> (0, 1) Typically sigmoid function f(0) = 0.5, f(- ∞) = 0 , f(∞) = 1 This value is a probability
Note on Choice of Link Function Properties: Bounded [0,1] Domain (- ∞, ∞) Differentiable everywhere Used in optimization Increasing function We do not lose the property of coefficient effect (positive and negative) in moving from linear to logistic
Visualizing Logistic Regression
Logistic Regression Loss Our Goal The Full Log-Likelihood A Note on Minimization
Our Goal Find P(y=1 | X) Probability to be estimated 1 Logistic Function: Φ 𝑦 = 1+𝑓 −𝑦 OLS Function Y = XB becomes Y = Φ (XB) 𝜚 1 … Φ = 𝜚 𝑂 y i = 𝜚 (x i . B) MLE – Maximum Likelihood Estimation 𝑂 L = 𝑄(𝑍 = 𝑧 𝑗 |𝑦 𝑗 ) 𝑗=1 Independent, so joint probability is the product of each observation
Our Goal 𝑂 L = 𝑄(𝑍 = 𝑧𝑗|𝑦𝑗) 𝑗=1 𝑄 𝑧𝑗 1 − 𝑄 1 − 𝑧𝑗 ) 𝑂 = 𝑗=1 Log-Likelihood, log(L) 𝑂 log(L) = [𝑧 𝑗 log 𝑞 + 1 − 𝑧 𝑗 log(1 − 𝑞) 𝑗=1 𝑂 = [𝑧 𝑗 log 𝜚(𝑦 𝑗 . 𝐶) + 1 − 𝑧 𝑗 log(1 − 𝜚 𝑦 𝑗 . 𝐶 ) 𝑗=1 Since we want to minimize, take the negative log likelihood -log(L) – will minimize this
The Full Log-Likelihood Minimize this in logistic regression So far, similar to OLS But this does not have an explicit solution so we need to minimize this numerically
Logistic Regression Loss
Gradient Descent
Batch/Full Gradient Descent The Gradient Algorithm 1. Chose x randomly 2. Compute gradient of f at x: 𝛼 f(x) 3. Step in the direction of the negative of the gradient x <- x – η ( 𝛼 f(x)) η = step size (too large, can overshoot, too small, takes too long) Repeat steps 2 and 3 until convergence Difference in iterations is not decreasing or have reached a certain number of iterations
Gradient Descent for OLS Numerical Minimization of OLS Mean Log-Likelihood L = -1/N ||y – XB|| 2 2 The Algorithm for OLS
Gradient Descent for Logistic Regression Numerical Minimization of Logistic Loss L(B) = - ∑ i [y i log( Φ (x i . B)) + (1-y)log(1- Φ (x i . B))] Two Components 1 Recall ∅ 𝑦 = 1+𝑓 −𝑦 Because this is symmetric about 1/2: ∅ 𝑦 + ∅ −𝑦 = 1
Gradient Descent for Logistic Regression First Term 𝜖 1 𝜖 ))] ) 𝜖𝐶 𝑘0 [log (Φ(𝑦 ∙ 𝐶 𝜖𝐶 𝑘0 𝜚(𝑦 𝑗 ∙ 𝐶 = ∙𝐶) 𝜚(𝑦 𝑗 1+𝑓 − 𝑦 𝑗 .𝐶 𝜖 1 = 1+𝑓 − 𝑦𝑗.𝐶 1 𝜖𝐶 𝑘0 1 = 1 + 𝑓 − 𝑦𝑗.𝐶 2 𝑓 −𝑦𝑗.𝐶 (−𝑦 𝑗𝑘 ) 1+𝑓 − 𝑦𝑗.𝐶 1 1+𝑓 𝑦𝑗.𝐶 𝑦 𝑗𝑘0 = = 𝜚(−𝑦 𝑗 ∙ 𝐶)𝑦 𝑗𝑘0 = (1 − 𝜚(𝑦 𝑗 ∙ 𝐶))𝑦 𝑗𝑘0 𝜖 ))] = 𝑦 𝑗 𝜖𝐶 𝑘0 [log (Φ(𝑦 ∙ 𝐶 𝑈 (1 − 𝜚(𝑦 𝑗 ∙ 𝐶))𝑦 𝑗𝑘0
Gradient Descent for Logistic Regression Second Term 𝜖 ))] = 𝜖 ) 𝜖𝐶 𝑘0 [log (1 − Φ(𝑦 ∙ 𝐶 𝜖𝐶 log (𝜚 −𝑦 𝑗 ∙ 𝐶 ) = −𝑦 𝑗 𝑈 (1 − 𝜚 −𝑦 𝑗 ∙ 𝐶 ) = −𝑦 𝑗 𝑈 (𝜚 𝑦 𝑗 ∙ 𝐶
Gradient Descent for Logistic Regression All Terms 𝜖𝑀(𝐶) )) + )] = − [𝒛 𝒋 𝒚 𝒋 𝑼 (1 − 𝜚(𝑦 ∙ 𝐶 (1 − 𝑧 𝑗 )(−𝑦 𝑗 𝑈 )𝜚(𝑦 𝑗 ∙ 𝐶 𝒋 𝜖𝑪 ) + 𝑦 𝑗 𝑈 𝑧 𝑗 𝜚(𝑦 𝑗 ) - 𝑦 𝑗 𝑈 𝜚(𝑦 𝑗 ) ] = − [𝒛 𝒋 𝒚 𝒋 𝑼 −𝑧 𝑗 𝑦 𝑗 𝑈 𝜚(𝑦 𝑗 ∙ 𝐶 ∙ 𝐶 ∙ 𝐶 𝒋 )) = − 𝒚 𝒋 𝑼 (𝑧 𝑗 − 𝜚(𝑦 𝑗 ∙ 𝐶 𝒋 = −X T 𝒛 − 𝚾 𝒀𝑪 𝜚 1 … where 𝚾 = 𝜚 𝑂
Gradient Descent for Logistic Regression 𝜖𝑀 ) = −𝑌 𝑈 (𝑧 − Φ 𝑌𝐶 𝜖𝐶 where Φ is the individual logistic functions Normalize this by the number of observations Divide by N to get the mean loss
Gradient Descent for Logistic Regression Algorithm: Pick learning rate η Initialize B randomly Iterate: 𝜖𝑀 𝜃 ⟵ 𝐶 − 𝜃 + ] 𝑂 𝑌 𝑈 [𝑧 𝐶 = 𝐶 − Φ 𝑌𝐶 𝜖𝐶
Comparing OLS and Logistic Regression The Two Update Steps OLS: 𝑂 𝑌 𝑈 𝑧 2𝜃 ← 𝐶 + 𝐶 − 𝑌𝐶 constant residual error Logistic: 𝑂 𝑌 𝑈 𝑧 𝜃 ← 𝐶 + 𝐶 − Φ(𝑌𝐶) constant error
Comparing OLS and Logistic Regression Regularization OLS Loss Function under Regularization: 2 + 𝜀 2 𝐶 𝑀 = 𝑧 − 𝑌𝐶 2 2 2 Logistic: 2 + 𝜀 2 𝐶 𝑀 = 𝑧 − Φ(𝑌𝐶) 2 2 2 L2 Norm – Ridge Regression: Uniform Regularization L1 Norm – LASSO Regression: Dimensionality Reduction
Multi-Class Logistic Regression Using Softmax Softmax 𝑓 𝐶𝑘∙𝑦𝑗 𝑄 𝑧 = 𝑑 𝑘 𝑦 𝑗 = , 𝑧 ∈ {1, … , 𝑛} 𝑓 𝐶𝑘∙𝑦𝑗 𝑛 𝑘=1 𝑛 𝑄 𝑧 = 𝑑 𝑘 𝑦 𝑗 = 1 , so it is a probability 𝑘=1
Multi-Class Logistic Regression Using Softmax Comparison with Logistic Regression in the Case of Two Classes 1 2 Classes: 𝑄 𝑧 = 1 𝑦 𝑗 = 1+𝑓 −𝑦𝑗𝐶 Softmax with 2 classes: 𝑓 𝑦𝑗𝐶1 𝑓 −𝑦𝑗𝐶1 𝑄 𝑧 = 1 𝑦 𝑗 = 𝑓 𝑦𝑗𝐶1 +𝑓 𝑦𝑗𝐶2 multiply this by 𝑓 −𝑦𝑗𝐶1 1 = 1+𝑓 −𝑦𝑗(𝐶1−𝐶2)
Softmax Optimization Classification Probabilities: 𝜌 1,𝑘 𝑓 𝑦𝑗𝐶𝑘 … Π 𝑗,𝑘 = , 𝜌 𝑘 = 𝑓 𝑦𝑗𝐶𝑙 𝜌 𝑂,𝑘 𝑙 Probabilities of observation i, class j The Gradients: 𝜖𝐶 𝑘 = −𝑌 𝑈 𝑧 𝑘 𝜖𝑀 − 𝜌 𝑘 ) Logistic: −𝑌 𝑈 (𝑧 − 𝜚 𝑌𝐶 Probability vector replaces sigmoid term y j us a column vector with all entries 0 except the j th
Softmax Gradient Descent Algorithm largely unchanged 1. Initialize learning rate 2. Randomly choose B 1 … B m + 𝜃 ← 𝐶 𝑂 𝑌 𝑈 (𝑧 𝑘 3. 𝐶 − 𝜌 𝑘 ) 𝑘 𝑘
Last Notes Learning Rate η : Small values take a long time to converge Large values, and convergence may not happen Important to monitor loss function on each itertation Also need to make sure you normalize so values don’t get too large Gradient descent algorithms across all are very similar
Summary Math Behind Logistic Regression Visualizing Logistic Regression Loss Function Minimizing Log Likelihood Function Batch/Full Logistic Regression Gradient Descent for OLS Gradient Descent for Logistic Regression Comparing OLS and Logistic Regression Multi-Class Logistic Regression Using Softmax
Recommend
More recommend