Lecture 3 Logistic Regression & Softmax Regression Rui Xia T ext M ining Group N anjing U niversity of S cience & T echnology rxia@njust.edu.cn
Supervised Learning • Regression • Classification Machine Learning, by Rui Xia @ NJUST 2
Logistic Regression Machine Learning, by Rui Xia @ NJUST 3
Introduction • Logistic Regression is a classification model, although it is called “regression”; • Logistic regression is a binary classification model; • Logistic regression is a linear classification model. It has a linear decision boundary (hyperplane), but with a nonlinear activation function (Sigmoid function) to model the posterior probability. Machine Learning, by Rui Xia @ NJUST 4
Model Hypothesis • Sigmoid Function 1 𝜀 𝑨 = 1 + 𝑓 −𝑨 𝑒𝜀 𝑨 = 𝜀 𝑨 (1 − 𝜀 𝑨 ) 𝑒𝑨 • Hypothesis 1 𝑞 𝑧 = 1|𝑦; 𝜄 = ℎ 𝜄 𝑦 = 𝜀 𝜄 𝑈 𝑦 = 1+𝑓 −𝜄𝑈𝑦 𝑞 𝑧 = 0 |𝑦; 𝜄 = 1 − ℎ 𝜄 𝑦 • Hypothesis (Compact Form) 1 1 𝑞 𝑧 |𝑦; 𝜄 = (ℎ 𝜄 𝑦 ) 𝑧 (1 − ℎ 𝜄 𝑦 ) (1−𝑧) = ( 1 + 𝑓 −𝜄 𝑈 𝑦 ) 𝑧 (1 − 1 + 𝑓 −𝜄 𝑈 𝑦 ) (1−𝑧) Machine Learning, by Rui Xia @ NJUST 5
Learning Algorithm • (Conditional) Likelihood Function 𝑂 𝑞 𝑧 𝑗 𝑦 𝑗 ; 𝜄 𝑀 𝜄 = ෑ 𝑗=1 𝑂 𝑧 (𝑗) (1−𝑧 (𝑗) ) ℎ 𝜄 𝑦 (𝑗) 1 − ℎ 𝜄 𝑦 (𝑗) = ෑ 𝑗=1 𝑂 𝑧 (𝑗) (1−𝑧 (𝑗) ) 1 1 = ෑ 1 − 1 + 𝑓 −𝜄 𝑈 𝑦 (𝑗) 1 + 𝑓 −𝜄 𝑈 𝑦 (𝑗) 𝑗=1 • Maximum Likelihood Estimation 𝑜 𝑧 (𝑗) logℎ 𝜄 𝑦 (𝑗) + 1 − 𝑧 𝑗 log 1 − ℎ 𝜄 𝑦 𝑗 max 𝑀 𝜄 ֞ max 𝜄 𝜄 𝑗=1 The neg log-likelihood function is also known as the Cross-Entropy cost function Machine Learning, by Rui Xia @ NJUST 6
Unconstraint Optimization • Unconstraint Optimization Problem 𝑜 𝑧 (𝑗) logℎ 𝜄 𝑦 (𝑗) + 1 − 𝑧 𝑗 log 1 − ℎ 𝜄 𝑦 𝑗 max 𝜄 𝑗=1 • Optimization Methods – Gradient Descent – Stochastic Gradient Descent – Newton Method – Quasi-Newton Method – Conjugate Gradient – … Machine Learning, by Rui Xia @ NJUST 7
Gradient Descent/Ascent • Gradient Computation 𝑂 𝑒𝑚(𝜄) 1 1 𝜖 ℎ 𝜄 (𝑦 (𝑗) ) − (1 − 𝑧 𝑗 ) 𝑧 (𝑗) 𝜖𝜄 ℎ 𝜄 (𝑦 (𝑗) ) = 1 − ℎ 𝜄 𝑦 𝑗 𝑒𝜄 𝑗=1 𝑂 1 1 𝜖 𝑧 𝑗 − 1 − 𝑧 𝑗 ℎ 𝜄 𝑦 𝑗 1 − ℎ 𝜄 𝑦 𝑗 𝜖𝜄 𝜄 𝑈 𝑦 𝑗 = ℎ 𝜄 𝑦 𝑗 1 − ℎ 𝜄 𝑦 𝑗 𝑗=1 𝑂 𝑧 𝑗 1 − ℎ 𝜄 𝑦 𝑗 − 1 − 𝑧 𝑗 ℎ 𝜄 𝑦 𝑗 𝑦 𝑗 = 𝑗=1 𝑂 𝑧 𝑗 − ℎ 𝜄 𝑦 𝑗 𝑦 𝑗 = Error × Feature 𝑗=1 • Gradient Ascent Optimization 𝑂 𝑧 𝑗 − ℎ 𝜄 𝑦 𝑗 𝑦 𝑗 𝜄 ≔ 𝜄 + 𝛽 𝑗=1 Machine Learning, by Rui Xia @ NJUST 8
Stochastic Gradient Descent • Randomly choose a training sample (𝑦, 𝑧) • Compute gradient (𝑧 − ℎ 𝜄 (𝑦))𝑦 • Updating weights 𝜄 ≔ 𝜄 + 𝛽(𝑧 − ℎ 𝜄 (𝑦))𝑦 • Repeat… Gradient descent -- batch updating Stochastic gradient descent -- online updating Machine Learning, by Rui Xia @ NJUST 9
GD vs. SGD Gradient Descent (GD) Stochastic Gradient Descent (SGD) Machine Learning, by Rui Xia @ NJUST 10
Illustration of Newton’s Method = 𝑔 ′ 𝜄 0 + 𝑔′′(𝜄 0 )(𝜄 − 𝜄 0 ) tangent line: 𝜄 (1) = 𝜄 (0) − 𝑔 ′ 𝜄 (0) 𝑔′′(𝜄 (0) ) = 𝑔′(𝜄) 𝜄 (3) , 𝜄 (4) , ⋯ , 𝜄 ∗ 𝜄 (0) 𝜄 (2) 𝜄 (1) 𝜄 𝜄 (2) = 𝜄 (1) − 𝑔 ′ 𝜄 (1) 𝑔′′(𝜄 (1) ) Machine Learning, by Rui Xia @ NJUST 11
Newton’s Method • Problem arg min 𝑔 𝜄 ֞ 𝑡𝑝𝑚𝑤𝑓 ∶ 𝛼𝑔 𝜄 = 0 • Second-order Taylor expansion + 1 2 ≈ 𝑔(𝜄) 2 𝛼 2 𝑔(𝜄 𝑙 ) θ − 𝜄 𝑙 𝜚 𝜄 = 𝑔 𝜄 𝑙 + 𝛼𝑔 𝜄 𝑙 θ − 𝜄 𝑙 𝛼𝜚 𝜄 = 0 ֜ 𝜄 = 𝜄 𝑙 − 𝛼 2 𝑔(𝜄 𝑙 ) −1 𝛼𝑔(𝜄 𝑙 ) • Newton’s method (also called Newton-Raphson method) 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛼 2 𝑔(𝜄 𝑙 ) −1 𝛼𝑔(𝜄 𝑙 ) Hessian Matrix Machine Learning, by Rui Xia @ NJUST 12
Gradient’ vs. Newton’s Method Machine Learning, by Rui Xia @ NJUST 13
Newton’s Method for Logistic Regression • Optimization Problem 𝑂 arg min 1 −𝑧 (𝑗) logℎ 𝜄 𝑦 (𝑗) − 1 − 𝑧 𝑗 log 1 − ℎ 𝜄 𝑦 (𝑗) 𝑂 𝑗=1 • Gradient and Hessian Matrix 𝑂 𝛼𝐾 𝜄 = 1 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑦 𝑗 𝑂 𝑗=1 𝑂 𝐼 = 1 T 1 − ℎ 𝜄 𝑦 𝑗 𝑦 𝑗 (𝑦 (𝑗) ) T ℎ 𝜄 𝑦 𝑗 𝑂 𝑗=1 • Weight updating using Newton’s method 𝜄 (𝑢+1) = 𝜄 (𝑢) − 𝐼 −1 𝛼𝐾(𝜄 (𝑢) ) Machine Learning, by Rui Xia @ NJUST 14
Practice: Logistic Regression • Given the following training data: http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex4/ex4.html • Implement 1) GD; 2) SGD; 3) Newton's Method for logistic regression, starting with the initial parameter \theta=0. • Determine how many iterations to use, and calculate for each iteration and plot your results. Machine Learning, by Rui Xia @ NJUST 15
Softmax Regression Machine Learning, by Rui Xia @ NJUST 16
Softmax Regression • Softmax Regression is a multi-class classification model, also called Multi-class Logistic Regression; • It is also known as the Maximum Entropy Model (in NLP); • It is one of the most used classification algorithms. Machine Learning, by Rui Xia @ NJUST 17
Model Description • Model Hypothesis T 𝑦 𝑓 𝜄 𝑘 𝑞 𝑧 = 𝑘 𝑦; 𝜄 = ℎ 𝑘 𝑦 = T 𝑦 , 𝑘 = 1, … , 𝐷 − 1 𝑑−1 𝑓 𝜄 𝑘’ 1 + σ 𝑘’=1 1 𝑞 𝑧 = 𝐷 𝑦; 𝜄 = ℎ 𝐷 𝑦 = 𝑑−1 exp{𝜄 T 𝑦} 1 + σ 𝑘’=1 𝑘’ • Model Hypothesis (Compact Form) T 𝑦 𝑓 𝜄 𝑘 𝑞 𝑧 = 𝑘 𝑦; 𝜄 = ℎ 𝑘 𝑦 = T 𝑦 , 𝑘 = 1,2, … , 𝐷, where 𝜄 𝐷 = 0 𝑓 𝜄 𝑘’ 𝐷 σ 𝑘’=1 • Parameters 𝜄 𝐷×𝑁 Machine Learning, by Rui Xia @ NJUST 18
Maximum Likelihood Estimation • (Conditional) Log-likelihood 𝑂 Softmax Regression log 𝑞(𝑧 𝑗 |𝑦 𝑗 ; 𝜄) 𝑚 𝜄 = 𝑗=1 1{𝑧 𝑗 =𝑘} 𝑂 𝐷 T 𝑦 𝑓 𝜄 𝑘 = log ෑ T 𝑦 𝑓 𝜄 𝑘’ 𝐷 σ 𝑘’=1 𝑗=1 𝑘=1 𝑂 𝐷 T 𝑦 𝑓 𝜄 𝑘 1 𝑧 𝑗 = 𝑘 log = T 𝑦 𝑓 𝜄 𝑘’ 𝐷 σ 𝑘’=1 𝑗=1 𝑘=1 𝑂 𝐷 1 𝑧 𝑗 = 𝑘 log ℎ 𝑘 (𝑦 𝑗 ) = 𝑗=1 𝑘=1 Logistic Regression 𝑂 𝑧 𝑗 log ℎ 𝜄 𝑦 𝑗 + 1 − 𝑧 𝑗 log 1 − ℎ 𝜄 𝑦 𝑗 𝑚 𝜄 = 𝑗=1 Machine Learning, by Rui Xia @ NJUST 19
Gradient Descent Optimization • Gradient 𝜖 log ℎ 𝑘 (𝑦) = ൝ 1 − ℎ 𝑙 𝑦 𝑦, 𝑘 = 𝑙 𝜖𝜄 𝑙 −ℎ 𝑙 𝑦 𝑦, 𝑘 ≠ 𝑙 𝐷 𝜖 σ 𝑘=1 1{𝑧 = 𝑘} log ℎ 𝑘 (𝑦) = ൝ 1 − ℎ 𝑙 𝑦 𝑦, 𝑧 = 𝑙 𝜖𝜄 𝑙 −ℎ 𝑙 𝑦 𝑦, 𝑧 ≠ 𝑙 = 1 𝑧 = 𝑙 − ℎ 𝑙 𝑦 𝑦 𝑂 𝜖𝑚(𝜄) 1 𝑧 𝑗 = 𝑙 − ℎ 𝑙 (𝑦 𝑗 ) 𝑦 𝑗 = 𝜖𝜄 𝑙 𝑗=1 Error × Feature Machine Learning, by Rui Xia @ NJUST 20
Gradient Descent Optimization • Gradient Descent 𝑂 1 𝑧 𝑗 = 𝑙 − ℎ 𝑙 (𝑦 𝑗 ) 𝑦 𝑗 𝜄 𝑙 : = 𝜄 𝑙 + 𝛽 𝑗=1 T 𝑦 𝑓 𝜄 𝑙 where ℎ 𝑙 𝑦 = T 𝑦 , 𝑙 = 1,2, … , 𝐷 𝐷 𝑓 𝜄 𝑙’ σ 𝑙’=1 • Stochastic Gradient Descent 𝜄 𝑙 : = 𝜄 𝑙 + 𝛽 1 𝑧 = 𝑙 − ℎ 𝑙 𝑦 𝑦 Machine Learning, by Rui Xia @ NJUST 21
The other optimization methods • Newton Method • Quasi-Newton Method (BFGS) • Limited Memory BFGS (L-BFGS) • Conjugate Gradient • GIS • IIS • … Machine Learning, by Rui Xia @ NJUST 22
Practice: Softmax Regression • Given the following training data: http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex4/ex4.html • Implement logistic regression with 1) GD; 2) SGD. • Implement softmax regression with 1) GD; 2) SGD. • Compare logisitic regression and softmax regression. Machine Learning, by Rui Xia @ NJUST 23
Questions? Machine Learning, by Rui Xia @ NJUST 24
Recommend
More recommend