CSE 802 Spring 2017 Logistic Regression Inci M. Baytas Computer Science Michigan State University March 29, 2017 1 / 10
Introduction ◮ Consider two-class classification problem, the posterior probability of class C 1 can be written as: w T Φ � � p ( C 1 | Φ) = y (Φ) = σ (1) ◮ σ ( · ) is the logistic sigmoid function. ◮ p ( C 2 | Φ) = 1 − p ( C 1 | Φ) ◮ Φ is a feature vector, a non-linear transformation on original observation space x . ◮ The model in Eq.1 is called as Logistic Regression in the terminology of statistics. 2 / 10
Logistic Regression I ◮ A classification model rather than regression ◮ A probabilistic discriminative model ◮ We estimate the parameter w directly. ◮ Comparison of logistic regression and generative model in M -dimensional space Φ : ◮ Logistic regression: M adjustable parameters. ◮ Generative models: Assume we fit Gaussian class conditional densities using maximum likelihood; M ( M + 5) / 2 + 1 = Means: 2 M + Shared covariance: ( M + 1) M/ 2 + Prior p ( C 1 ) : 1 ◮ Maximum likelihood is used to determine the parameters of logistic regression model. 3 / 10
Logistic Regression II ◮ Definition and properties of logistic sigmoid function: 1 σ ( a ) = 1 + exp ( − a ) (2) σ ( − a ) = 1 − σ ( a ) dσ da = σ (1 − σ ) 4 / 10
Logistic Regression III - How to Estimate w ◮ For a training data set { Φ n , t n } , where t n ∈ { 0 , 1 } and Φ n = Φ ( x n ) , with n = 1 , ..., N , the log likelihood can be written as: N n { 1 − y n } 1 − t n � y t n p ( t | w ) = (3) n =1 where t = ( t 1 , ..., t N ) T and y n = p ( C 1 | Φ n ) ◮ The error function is the negative logarithm of the likelihood, known as Cross-entropy error function: N � E ( w ) = − ln p ( t | w ) = − { t n ln y n + (1 − t n ) ln (1 − y n ) } n =1 (4) where y n = σ ( a n ) and a n = w T Φ n . 5 / 10
Logistic Regression IV - How to Estimate w ◮ There is no analytical (closed-form) solution. ◮ The cross entropy loss is a convex function. ◮ There is a global minimum. ◮ Can use an iterative approach. ◮ Calculate the gradient with respect to w : N � ∇ E ( w ) = ( y n − t n ) Φ n (5) n =1 ◮ Use gradient descent (batch or online): w τ +1 = w τ − η ∇ E ( w τ ) (6) 6 / 10
Logistic Regression V - How to Estimate w ◮ Newton-Raphson Algorithm w ( new ) = w ( old ) − H − 1 ∇ E ( w ) (7) ◮ It uses a local quadratic approximation to the cross-entropy error function to update w iteratively. ◮ Newton-Raphson algorithm is also known as iterative reweighted least squares . ◮ Convexity: H is positive definite (eigenvalues of H are non-negative). 7 / 10
Multi-class Logistic Regression ◮ Cross-entropy for multi-class classification problem: N K � � E ( w 1 , ..., w K ) = − t nk ln y nk (8) n =1 k =1 exp ( w T k Φ ) where y k (Φ) = p ( C k | Φ) = j Φ ) which is called j exp ( w T � softmax function . ◮ Use maximum likelihood to estimate the parameters. ◮ Use an iterative approach such as Newton-Rapson. 8 / 10
Over-fitting in Logistic Regression ◮ Maximum likelihood can suffer from severe over-fitting. ◮ This can be overcome by finding a MAP solution for w (Bayesian treatment). ◮ Another alternative is to use regularization. ◮ Add regularizers to the loss function, regularized log-likelihood. ◮ ℓ 2 norm ◮ ℓ 1 norm (Lasso) 9 / 10
References ◮ Classification lecture of Dr. Jiayu Zhou. ◮ Christopher Bishop, Pattern Recognition and Machine Learning, Information Science and Statistics , Springer-Verlag New York, 2006. 10 / 10
Recommend
More recommend