Logistic Regression & Softmax Regression Rui Xia T ext M ining - PowerPoint PPT Presentation

Lecture 3 Logistic Regression & Softmax Regression Rui Xia T ext M ining Group N anjing U niversity of S cience & T echnology rxia@njust.edu.cn

Supervised Learning • Regression • Classification Machine Learning, by Rui Xia @ NJUST 2

Logistic Regression Machine Learning, by Rui Xia @ NJUST 3

Introduction • Logistic Regression is a classification model, although it is called “regression”; • Logistic regression is a binary classification model; • Logistic regression is a linear classification model. It has a linear decision boundary (hyperplane), but with a nonlinear activation function (Sigmoid function) to model the posterior probability. Machine Learning, by Rui Xia @ NJUST 4

Model Hypothesis • Sigmoid Function 1 𝜀 𝑨 = 1 + 𝑓 −𝑨 𝑒𝜀 𝑨 = 𝜀 𝑨 (1 − 𝜀 𝑨 ) 𝑒𝑨 • Hypothesis 1 𝑞 𝑧 = 1|𝑦; 𝜄 = ℎ 𝜄 𝑦 = 𝜀 𝜄 𝑈 𝑦 = 1+𝑓 −𝜄𝑈𝑦 𝑞 𝑧 = 0 |𝑦; 𝜄 = 1 − ℎ 𝜄 𝑦 • Hypothesis (Compact Form) 1 1 𝑞 𝑧 |𝑦; 𝜄 = (ℎ 𝜄 𝑦 ) 𝑧 (1 − ℎ 𝜄 𝑦 ) (1−𝑧) = ( 1 + 𝑓 −𝜄 𝑈 𝑦 ) 𝑧 (1 − 1 + 𝑓 −𝜄 𝑈 𝑦 ) (1−𝑧) Machine Learning, by Rui Xia @ NJUST 5

Learning Algorithm • (Conditional) Likelihood Function 𝑂 𝑞 𝑧 𝑗 𝑦 𝑗 ; 𝜄 𝑀 𝜄 = ෑ 𝑗=1 𝑂 𝑧 (𝑗) (1−𝑧 (𝑗) ) ℎ 𝜄 𝑦 (𝑗) 1 − ℎ 𝜄 𝑦 (𝑗) = ෑ 𝑗=1 𝑂 𝑧 (𝑗) (1−𝑧 (𝑗) ) 1 1 = ෑ 1 − 1 + 𝑓 −𝜄 𝑈 𝑦 (𝑗) 1 + 𝑓 −𝜄 𝑈 𝑦 (𝑗) 𝑗=1 • Maximum Likelihood Estimation 𝑜 𝑧 (𝑗) logℎ 𝜄 𝑦 (𝑗) + 1 − 𝑧 𝑗 log 1 − ℎ 𝜄 𝑦 𝑗 max 𝑀 𝜄 ֞ max ෍ 𝜄 𝜄 𝑗=1 The neg log-likelihood function is also known as the Cross-Entropy cost function Machine Learning, by Rui Xia @ NJUST 6

Unconstraint Optimization • Unconstraint Optimization Problem 𝑜 𝑧 (𝑗) logℎ 𝜄 𝑦 (𝑗) + 1 − 𝑧 𝑗 log 1 − ℎ 𝜄 𝑦 𝑗 max ෍ 𝜄 𝑗=1 • Optimization Methods – Gradient Descent – Stochastic Gradient Descent – Newton Method – Quasi-Newton Method – Conjugate Gradient – … Machine Learning, by Rui Xia @ NJUST 7

Gradient Descent/Ascent • Gradient Computation 𝑂 𝑒𝑚(𝜄) 1 1 𝜖 ℎ 𝜄 (𝑦 (𝑗) ) − (1 − 𝑧 𝑗 ) 𝑧 (𝑗) 𝜖𝜄 ℎ 𝜄 (𝑦 (𝑗) ) = ෍ 1 − ℎ 𝜄 𝑦 𝑗 𝑒𝜄 𝑗=1 𝑂 1 1 𝜖 𝑧 𝑗 − 1 − 𝑧 𝑗 ℎ 𝜄 𝑦 𝑗 1 − ℎ 𝜄 𝑦 𝑗 𝜖𝜄 𝜄 𝑈 𝑦 𝑗 = ෍ ℎ 𝜄 𝑦 𝑗 1 − ℎ 𝜄 𝑦 𝑗 𝑗=1 𝑂 𝑧 𝑗 1 − ℎ 𝜄 𝑦 𝑗 − 1 − 𝑧 𝑗 ℎ 𝜄 𝑦 𝑗 𝑦 𝑗 = ෍ 𝑗=1 𝑂 𝑧 𝑗 − ℎ 𝜄 𝑦 𝑗 𝑦 𝑗 = ෍ Error × Feature 𝑗=1 • Gradient Ascent Optimization 𝑂 𝑧 𝑗 − ℎ 𝜄 𝑦 𝑗 𝑦 𝑗 𝜄 ≔ 𝜄 + 𝛽 ෍ 𝑗=1 Machine Learning, by Rui Xia @ NJUST 8

Stochastic Gradient Descent • Randomly choose a training sample (𝑦, 𝑧) • Compute gradient (𝑧 − ℎ 𝜄 (𝑦))𝑦 • Updating weights 𝜄 ≔ 𝜄 + 𝛽(𝑧 − ℎ 𝜄 (𝑦))𝑦 • Repeat… Gradient descent -- batch updating Stochastic gradient descent -- online updating Machine Learning, by Rui Xia @ NJUST 9

GD vs. SGD Gradient Descent (GD) Stochastic Gradient Descent (SGD) Machine Learning, by Rui Xia @ NJUST 10

Illustration of Newton’s Method 𝑕 = 𝑔 ′ 𝜄 0 + 𝑔′′(𝜄 0 )(𝜄 − 𝜄 0 ) tangent line: 𝜄 (1) = 𝜄 (0) − 𝑔 ′ 𝜄 (0) 𝑕 𝑔′′(𝜄 (0) ) 𝑕 = 𝑔′(𝜄) 𝜄 (3) , 𝜄 (4) , ⋯ , 𝜄 ∗ 𝜄 (0) 𝜄 (2) 𝜄 (1) 𝜄 𝜄 (2) = 𝜄 (1) − 𝑔 ′ 𝜄 (1) 𝑔′′(𝜄 (1) ) Machine Learning, by Rui Xia @ NJUST 11

Newton’s Method • Problem arg min 𝑔 𝜄 ֞ 𝑡𝑝𝑚𝑤𝑓 ∶ 𝛼𝑔 𝜄 = 0 • Second-order Taylor expansion + 1 2 ≈ 𝑔(𝜄) 2 𝛼 2 𝑔(𝜄 𝑙 ) θ − 𝜄 𝑙 𝜚 𝜄 = 𝑔 𝜄 𝑙 + 𝛼𝑔 𝜄 𝑙 θ − 𝜄 𝑙 𝛼𝜚 𝜄 = 0 ֜ 𝜄 = 𝜄 𝑙 − 𝛼 2 𝑔(𝜄 𝑙 ) −1 𝛼𝑔(𝜄 𝑙 ) • Newton’s method (also called Newton-Raphson method) 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛼 2 𝑔(𝜄 𝑙 ) −1 𝛼𝑔(𝜄 𝑙 ) Hessian Matrix Machine Learning, by Rui Xia @ NJUST 12

Gradient’ vs. Newton’s Method Machine Learning, by Rui Xia @ NJUST 13

Newton’s Method for Logistic Regression • Optimization Problem 𝑂 arg min 1 −𝑧 (𝑗) logℎ 𝜄 𝑦 (𝑗) − 1 − 𝑧 𝑗 log 1 − ℎ 𝜄 𝑦 (𝑗) 𝑂 ෍ 𝑗=1 • Gradient and Hessian Matrix 𝑂 𝛼𝐾 𝜄 = 1 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑦 𝑗 𝑂 ෍ 𝑗=1 𝑂 𝐼 = 1 T 1 − ℎ 𝜄 𝑦 𝑗 𝑦 𝑗 (𝑦 (𝑗) ) T ℎ 𝜄 𝑦 𝑗 𝑂 ෍ 𝑗=1 • Weight updating using Newton’s method 𝜄 (𝑢+1) = 𝜄 (𝑢) − 𝐼 −1 𝛼𝐾(𝜄 (𝑢) ) Machine Learning, by Rui Xia @ NJUST 14

Practice: Logistic Regression • Given the following training data: http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex4/ex4.html • Implement 1) GD; 2) SGD; 3) Newton's Method for logistic regression, starting with the initial parameter \theta=0. • Determine how many iterations to use, and calculate for each iteration and plot your results. Machine Learning, by Rui Xia @ NJUST 15

Softmax Regression Machine Learning, by Rui Xia @ NJUST 16

Softmax Regression • Softmax Regression is a multi-class classification model, also called Multi-class Logistic Regression; • It is also known as the Maximum Entropy Model (in NLP); • It is one of the most used classification algorithms. Machine Learning, by Rui Xia @ NJUST 17

Model Description • Model Hypothesis T 𝑦 𝑓 𝜄 𝑘 𝑞 𝑧 = 𝑘 𝑦; 𝜄 = ℎ 𝑘 𝑦 = T 𝑦 , 𝑘 = 1, … , 𝐷 − 1 𝑑−1 𝑓 𝜄 𝑘’ 1 + σ 𝑘’=1 1 𝑞 𝑧 = 𝐷 𝑦; 𝜄 = ℎ 𝐷 𝑦 = 𝑑−1 exp{𝜄 T 𝑦} 1 + σ 𝑘’=1 𝑘’ • Model Hypothesis (Compact Form) T 𝑦 𝑓 𝜄 𝑘 𝑞 𝑧 = 𝑘 𝑦; 𝜄 = ℎ 𝑘 𝑦 = T 𝑦 , 𝑘 = 1,2, … , 𝐷, where 𝜄 𝐷 = 0 𝑓 𝜄 𝑘’ 𝐷 σ 𝑘’=1 • Parameters 𝜄 𝐷×𝑁 Machine Learning, by Rui Xia @ NJUST 18

Maximum Likelihood Estimation • (Conditional) Log-likelihood 𝑂 Softmax Regression log 𝑞(𝑧 𝑗 |𝑦 𝑗 ; 𝜄) 𝑚 𝜄 = ෍ 𝑗=1 1{𝑧 𝑗 =𝑘} 𝑂 𝐷 T 𝑦 𝑓 𝜄 𝑘 = ෍ log ෑ T 𝑦 𝑓 𝜄 𝑘’ 𝐷 σ 𝑘’=1 𝑗=1 𝑘=1 𝑂 𝐷 T 𝑦 𝑓 𝜄 𝑘 1 𝑧 𝑗 = 𝑘 log = ෍ ෍ T 𝑦 𝑓 𝜄 𝑘’ 𝐷 σ 𝑘’=1 𝑗=1 𝑘=1 𝑂 𝐷 1 𝑧 𝑗 = 𝑘 log ℎ 𝑘 (𝑦 𝑗 ) = ෍ ෍ 𝑗=1 𝑘=1 Logistic Regression 𝑂 𝑧 𝑗 log ℎ 𝜄 𝑦 𝑗 + 1 − 𝑧 𝑗 log 1 − ℎ 𝜄 𝑦 𝑗 𝑚 𝜄 = ෍ 𝑗=1 Machine Learning, by Rui Xia @ NJUST 19

Gradient Descent Optimization • Gradient 𝜖 log ℎ 𝑘 (𝑦) = ൝ 1 − ℎ 𝑙 𝑦 𝑦, 𝑘 = 𝑙 𝜖𝜄 𝑙 −ℎ 𝑙 𝑦 𝑦, 𝑘 ≠ 𝑙 𝐷 𝜖 σ 𝑘=1 1{𝑧 = 𝑘} log ℎ 𝑘 (𝑦) = ൝ 1 − ℎ 𝑙 𝑦 𝑦, 𝑧 = 𝑙 𝜖𝜄 𝑙 −ℎ 𝑙 𝑦 𝑦, 𝑧 ≠ 𝑙 = 1 𝑧 = 𝑙 − ℎ 𝑙 𝑦 𝑦 𝑂 𝜖𝑚(𝜄) 1 𝑧 𝑗 = 𝑙 − ℎ 𝑙 (𝑦 𝑗 ) 𝑦 𝑗 = ෍ 𝜖𝜄 𝑙 𝑗=1 Error × Feature Machine Learning, by Rui Xia @ NJUST 20

Gradient Descent Optimization • Gradient Descent 𝑂 1 𝑧 𝑗 = 𝑙 − ℎ 𝑙 (𝑦 𝑗 ) 𝑦 𝑗 𝜄 𝑙 : = 𝜄 𝑙 + 𝛽 ෍ 𝑗=1 T 𝑦 𝑓 𝜄 𝑙 where ℎ 𝑙 𝑦 = T 𝑦 , 𝑙 = 1,2, … , 𝐷 𝐷 𝑓 𝜄 𝑙’ σ 𝑙’=1 • Stochastic Gradient Descent 𝜄 𝑙 : = 𝜄 𝑙 + 𝛽 1 𝑧 = 𝑙 − ℎ 𝑙 𝑦 𝑦 Machine Learning, by Rui Xia @ NJUST 21

The other optimization methods • Newton Method • Quasi-Newton Method (BFGS) • Limited Memory BFGS (L-BFGS) • Conjugate Gradient • GIS • IIS • … Machine Learning, by Rui Xia @ NJUST 22

Practice: Softmax Regression • Given the following training data: http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex4/ex4.html • Implement logistic regression with 1) GD; 2) SGD. • Implement softmax regression with 1) GD; 2) SGD. • Compare logisitic regression and softmax regression. Machine Learning, by Rui Xia @ NJUST 23

Questions? Machine Learning, by Rui Xia @ NJUST 24

Logistic Regression & Softmax Regression Rui Xia T ext M ining - PowerPoint PPT Presentation

Lecture 3 Logistic Regression & Softmax Regression Rui Xia T ext M ining Group N anjing U niversity of S cience & T echnology rxia@njust.edu.cn Supervised Learning Regression Classification Machine Learning, by Rui Xia @ NJUST

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Coding and computation by neural ensembles in the primate retina Liam Paninski Department of

Introduction to EULAG (and cloud modeling in general) Wojciech W. Grabowski NCAR, Boulder, USA

Optimality theory for point estimates Why bother doing the Newton Raphson steps? Why not just use

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Position Based Dynamics A fast yet physically plausible method for deformable body simulation

stcrmix and Timing of Events with Stata Christophe Kolodziejczyk, VIVE August 30, 2017 stcrmix

Tracking Feature Windows COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision

Chapter 9:Advanced Programming Techniques A mixed bag of different methods to improve the