Logistic Regression, Gradient Descent, and Newton Method Matthieu - PDF document

1 (1) no clear methodology. Hence, we must resort to a numerical algorithm to find the solution of (7) Here, this means that (6) (5) (4) Tie log likelihood can therefore be written as (3) In case you are not familiar with this way of writing the likelihood, note that (2) keep in mind that we operate under the assumption that the first component of x is set to one. x , but ECE 6254 - Spring 2020 - Lecture 8 v1.0 - revised January 30, 2020 Logistic Regression, Gradient Descent, and Newton Method Matthieu R. Bloch 1 Maximum Likelihood Estimator (MLE) for logistic classification We will start with a standard trick to simplify notation, which consists in defining ˜ x = [1 , x ⊺ ] ⊺ and θ = [ b w ⊺ ] ⊺ . Tiis allows us to write the logistic model as 1 η ( x ) ≜ η 1 ( x ) = x ) . 1 + exp ( − θ ⊺ ˜ To avoid carrying a tilde repeatedly in our notation, we will now simply write x in place of ˜ i =1 the likelihood is L ( θ ) ≜ ∏ N Given our dataset { ( x i , y i ) } N i =1 P θ ( y i | x i ) , where we don’t try to model the distribution of x i as mentioned in Example ?? . For K = 2 and Y = { 0 , 1 } , we obtain N ∏ η ( x i ) y i (1 − η ( x i )) 1 − y i L ( θ ) ≜ i =1 { η ( x i ) = η 1 ( x i ) if y i = 1 η ( x i ) y i (1 − η ( x i )) 1 − y i = (1 − η ( x i )) = η 0 ( x i ) if y i = 0 . N ∑ ℓ ( θ ) ≜ log L ( θ ) = ( y i log η ( x i ) + (1 − y i ) log (1 − η ( x i ))) i =1 N e − θ ⊺ x ( 1 ) ∑ = 1 + e − θ ⊺ x + (1 − y i ) log y i log 1 + e − θ ⊺ x i =1 N ( ) ∑ y i θ ⊺ x i − log (1 + e θ ⊺ x i ) = . i =1 To find the minimum with respect to (w.r.t.) θ , a necessary condition for optimality is ∇ θ ℓ ( θ ) = 0 . N N e θ ⊺ x i ( ) ( ) 1 ∑ ∑ ∇ θ ℓ ( θ ) = y i x i − = y i − = 0 . 1 + e θ ⊺ x i x i x i 1 + e − θ ⊺ x i i =1 i =1 Solving this equation means solving a nonlinear system of d + 1 equations, for which there exists argmin θ − ℓ ( θ ) .

2 • LDA tends to work well if the assumption regarding the Gaussian distribution of the feature work), and we will only briefly review here important concepts. made by LDA to obtain a linear model. often require solving a more difficult problem as an intermediate step, see for instance the detour totally unpredictable. It can be hard to verify whether our assumptions are right and plugin methods distributions for which assumptions are violated and if our assumptions are wrong, the output is Plugin methods can be useful in practice, but ultimately they are very limited. Tiere are always class of distributions and results in fewer parameters to estimate. Gradient descent vectors in a class is valid; discrete and continuous features; obtain a closed form expression for the solution and one resorts to numerical algorithms to obtain given the class), but which scales well to high-dimensions and naturally handles mixture of • Naive Bayes is plugin method based on a seldom valid assumption (independence of features All have advantages and drawbacks: the hill” to find the minimum. A typical gradient descent algorithm would run as follows. Naive Bayes, Linear Discriminant Analysis (LDA), and logistic classification are all plugin methods Without further assumptions, there is no guarantee of convergence. In addition, the choice of step takes the form ton’s method, but there are many more that especially useful in high dimension. convergence guarantees. We will mention a few specific techniques, such as gradient descent, New- (8) the solution. We could spend an entire semester studying these algorithms in depth (and why they ECE 6254 - Spring 2020 - Lecture 8 v1.0 - revised January 30, 2020 You should check for yourself − ℓ ( θ ) is convex in θ , and there exists algorithms with provable 2 Conclusion regarding plug-in methods that result in linear classifiers, i.e., classifiers for which decision boundaries are hyperplanes in R d . • Logistic classification models only the distribution of P y | x , not P y, x , which is valid for a larger 3 Gradient descent and Newton’s method Assume that we wish to solve the problem min x ∈ R d f ( x ) where f : R d → R . Very often one cannot Tie idea of gradient descent is to find the minimum of f iteratively by following the opposite directions of the gradient ∇ f . Intuitvely, gradient descent consists in “rolling down • Start with a guess of the solution x (0) . • For j ⩾ 0 , update the estimate as x j +1 = x j − η ∇ f ( x j ) , where η > 0 is called the stepsize . size η really matters: too small and convergence takes forever, too big and the algorithm might never converge. Very often in machine learning, the function f to optimize is a loss function ℓ ( θ ) that N ∑ ℓ ( θ ) ≜ ℓ i ( θ ) , i =1

3 (10) . . . ... . . . Let us now assume that we decide to update our next point in the gradient descent update by . (11) (12) Tiis looks like the gradient descent update equation except that we have chosen the stepsize to be a matrix Newton’s methods has much faster convergence than gradient descent, but requires the calcu- techniques that attempt to adapt the step size without having to compute a full Hessian. 1 Tie data index may be chosen to be different from the iteration step index. . . (9) One drawback of the basic gradient descent sketched earlier is the presence of Newton’s method ECE 6254 - Spring 2020 - Lecture 8 v1.0 - revised January 30, 2020 where ℓ i ( θ ) is a function of θ and the data point ( x i , y i ) but not the other data points. When the number of data points N is very large, or when the data points cannot all be accessed at the same time, a typical approach is to not compute the exact gradient ∇ ℓ to perform updates but to only evaluate ∇ ℓ i . In its simplest form, stochastic gradient descent consists of the following rule. • Start with a guess of the solution θ (0) . • For j ⩾ 0 , update the estimate as θ ( j +1) = θ ( j ) − η ∇ ℓ j ( θ ( j ) ) , where η > 0 is the stepsize. 1 Note that the update only depends on the loss function evaluated at a single point ( x j , y j ) . a parameter η , which has to be set a priori . Tiere exists many methods to choose η adaptively, and Newton’s method is one of many such methods. Specifically, consider a quadratic approximation ˜ f of the function f at a point x : f ( x ′ ) ≜ f ( x ) + ∇ f ( x )( x ′ − x ) + 1 2( x ′ − x ) ⊺ ∇ 2 f ( x )( x ′ − x ) . ˜ Tie matrix ∇ 2 f ( x ) ∈ R d × d is called the Hessian of f and is defined as ∂ 2 f ∂ 2 f ∂ 2 f   · · · ∂x 2 ∂x 1 ∂x 2 ∂x 1 ∂x d 1 ∂ 2 f ∂ 2 f ∂ 2 f   · · ·  ∂x 2  ∂x 2 ∂x 1 ∂x 2 ∂x d ∇ 2 f ( x ) = 2         ∂ 2 f ∂ 2 f ∂ 2 f · · · ∂x 2 ∂x d ∂x 1 ∂x d ∂x 2 d f ( x ′ ) = ∇ f ( x ) + ∇ 2 f ( x ) x ′ − ∇ 2 f ( x ) x . Note that the gradient of ˜ f is ∇ ˜ choosing x j +1 as the minimum of the quadratic approximation of f at x j . Because ˜ f is quadratic, we can find the minimum by solving ∇ ˜ f ( x j +1 ) = 0 : ∇ ˜ f ( x j +1 ) = 0 ⇔ ∇ f ( x j ) + ∇ 2 f ( x j ) x j +1 − ∇ 2 f ( x j ) x j = 0 ] − 1 ∇ f ( x j ) . ∇ 2 f ( x j ) [ ⇔ x j +1 = x j − ] − 1 ; this adjusts how much we move as a function of the local curvature. ∇ 2 f ( x j ) [ lation of the inverse of the Hessian. Tiis is feasible when the dimension d is small but impractical when d is large. In many machine learning problems, researchers therefore focus on gradient descent

Logistic Regression, Gradient Descent, and Newton Method Matthieu - PDF document

1 (1) no clear methodology. Hence, we must resort to a numerical algorithm to find the solution of (7) Here, this means that (6) (5) (4) Tie log likelihood can therefore be written as (3) In case you are not familiar with this way of

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Logistic Regression Lecture 6 Logistic Regression Classification Model CS 335

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Logical reduction of metarules Andrew Cropper & Sophie Tourret ILP Examples Learner

Relational Learning from Ambiguous Examples Dominique Bouthinon Henry Soldano Laboratoire

Fifth GF Summer School 2017, Riga, August 18, 2017 About Tilde and what we do Grammar

Introduction to WinBUGS Introduction to WinBUGS WinBUGS is the Windows version of the B B ayesian

ARNOLD STABILITY of TIME-OSCILLATING FLOWS Legacy of Vladimir Arnold Fields Institute, November,

Ramsey regularity, MAD families, and their relatives David Schrittesser (KGRC) Joint work with

Learning Representations of Relational Data Sebastijan Dumani DTAI, CS Department, KU Leuven

A FRAMEWORK FOR MULTILINGUAL AND SEMANTIC ENRICHMENT OF DIGITAL CONTENT (NEW L10N BUSINESS

Logistic Regression, Gradient Descent, and Newton Method Matthieu - PDF document

1 (1) no clear methodology. Hence, we must resort to a numerical algorithm to find the solution of (7) Here, this means that (6) (5) (4) Tie log likelihood can therefore be written as (3) In case you are not familiar with this way of

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Logistic Regression Lecture 6 Logistic Regression Classification Model CS 335

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Logical reduction of metarules Andrew Cropper &amp; Sophie Tourret ILP Examples Learner

Relational Learning from Ambiguous Examples Dominique Bouthinon Henry Soldano Laboratoire

Fifth GF Summer School 2017, Riga, August 18, 2017 About Tilde and what we do Grammar

Introduction to WinBUGS Introduction to WinBUGS WinBUGS is the Windows version of the B B ayesian

ARNOLD STABILITY of TIME-OSCILLATING FLOWS Legacy of Vladimir Arnold Fields Institute, November,

Ramsey regularity, MAD families, and their relatives David Schrittesser (KGRC) Joint work with

Learning Representations of Relational Data Sebastijan Dumani DTAI, CS Department, KU Leuven

A FRAMEWORK FOR MULTILINGUAL AND SEMANTIC ENRICHMENT OF DIGITAL CONTENT (NEW L10N BUSINESS

Logical reduction of metarules Andrew Cropper & Sophie Tourret ILP Examples Learner