1 (1) no clear methodology. Hence, we must resort to a numerical algorithm to find the solution of (7) Here, this means that (6) (5) (4) Tie log likelihood can therefore be written as (3) In case you are not familiar with this way of writing the likelihood, note that (2) keep in mind that we operate under the assumption that the first component of x is set to one. x , but ECE 6254 - Spring 2020 - Lecture 8 v1.0 - revised January 30, 2020 Logistic Regression, Gradient Descent, and Newton Method Matthieu R. Bloch 1 Maximum Likelihood Estimator (MLE) for logistic classification We will start with a standard trick to simplify notation, which consists in defining ˜ x = [1 , x ⊺ ] ⊺ and θ = [ b w ⊺ ] ⊺ . Tiis allows us to write the logistic model as 1 η ( x ) ≜ η 1 ( x ) = x ) . 1 + exp ( − θ ⊺ ˜ To avoid carrying a tilde repeatedly in our notation, we will now simply write x in place of ˜ i =1 the likelihood is L ( θ ) ≜ ∏ N Given our dataset { ( x i , y i ) } N i =1 P θ ( y i | x i ) , where we don’t try to model the distribution of x i as mentioned in Example ?? . For K = 2 and Y = { 0 , 1 } , we obtain N ∏ η ( x i ) y i (1 − η ( x i )) 1 − y i L ( θ ) ≜ i =1 { η ( x i ) = η 1 ( x i ) if y i = 1 η ( x i ) y i (1 − η ( x i )) 1 − y i = (1 − η ( x i )) = η 0 ( x i ) if y i = 0 . N ∑ ℓ ( θ ) ≜ log L ( θ ) = ( y i log η ( x i ) + (1 − y i ) log (1 − η ( x i ))) i =1 N e − θ ⊺ x ( 1 ) ∑ = 1 + e − θ ⊺ x + (1 − y i ) log y i log 1 + e − θ ⊺ x i =1 N ( ) ∑ y i θ ⊺ x i − log (1 + e θ ⊺ x i ) = . i =1 To find the minimum with respect to (w.r.t.) θ , a necessary condition for optimality is ∇ θ ℓ ( θ ) = 0 . N N e θ ⊺ x i ( ) ( ) 1 ∑ ∑ ∇ θ ℓ ( θ ) = y i x i − = y i − = 0 . 1 + e θ ⊺ x i x i x i 1 + e − θ ⊺ x i i =1 i =1 Solving this equation means solving a nonlinear system of d + 1 equations, for which there exists argmin θ − ℓ ( θ ) .
2 • LDA tends to work well if the assumption regarding the Gaussian distribution of the feature work), and we will only briefly review here important concepts. made by LDA to obtain a linear model. often require solving a more difficult problem as an intermediate step, see for instance the detour totally unpredictable. It can be hard to verify whether our assumptions are right and plugin methods distributions for which assumptions are violated and if our assumptions are wrong, the output is Plugin methods can be useful in practice, but ultimately they are very limited. Tiere are always class of distributions and results in fewer parameters to estimate. Gradient descent vectors in a class is valid; discrete and continuous features; obtain a closed form expression for the solution and one resorts to numerical algorithms to obtain given the class), but which scales well to high-dimensions and naturally handles mixture of • Naive Bayes is plugin method based on a seldom valid assumption (independence of features All have advantages and drawbacks: the hill” to find the minimum. A typical gradient descent algorithm would run as follows. Naive Bayes, Linear Discriminant Analysis (LDA), and logistic classification are all plugin methods Without further assumptions, there is no guarantee of convergence. In addition, the choice of step takes the form ton’s method, but there are many more that especially useful in high dimension. convergence guarantees. We will mention a few specific techniques, such as gradient descent, New- (8) the solution. We could spend an entire semester studying these algorithms in depth (and why they ECE 6254 - Spring 2020 - Lecture 8 v1.0 - revised January 30, 2020 You should check for yourself − ℓ ( θ ) is convex in θ , and there exists algorithms with provable 2 Conclusion regarding plug-in methods that result in linear classifiers, i.e., classifiers for which decision boundaries are hyperplanes in R d . • Logistic classification models only the distribution of P y | x , not P y, x , which is valid for a larger 3 Gradient descent and Newton’s method Assume that we wish to solve the problem min x ∈ R d f ( x ) where f : R d → R . Very often one cannot Tie idea of gradient descent is to find the minimum of f iteratively by following the opposite directions of the gradient ∇ f . Intuitvely, gradient descent consists in “rolling down • Start with a guess of the solution x (0) . • For j ⩾ 0 , update the estimate as x j +1 = x j − η ∇ f ( x j ) , where η > 0 is called the stepsize . size η really matters: too small and convergence takes forever, too big and the algorithm might never converge. Very often in machine learning, the function f to optimize is a loss function ℓ ( θ ) that N ∑ ℓ ( θ ) ≜ ℓ i ( θ ) , i =1
3 (10) . . . ... . . . Let us now assume that we decide to update our next point in the gradient descent update by . (11) (12) Tiis looks like the gradient descent update equation except that we have chosen the stepsize to be a matrix Newton’s methods has much faster convergence than gradient descent, but requires the calcu- techniques that attempt to adapt the step size without having to compute a full Hessian. 1 Tie data index may be chosen to be different from the iteration step index. . . (9) One drawback of the basic gradient descent sketched earlier is the presence of Newton’s method ECE 6254 - Spring 2020 - Lecture 8 v1.0 - revised January 30, 2020 where ℓ i ( θ ) is a function of θ and the data point ( x i , y i ) but not the other data points. When the number of data points N is very large, or when the data points cannot all be accessed at the same time, a typical approach is to not compute the exact gradient ∇ ℓ to perform updates but to only evaluate ∇ ℓ i . In its simplest form, stochastic gradient descent consists of the following rule. • Start with a guess of the solution θ (0) . • For j ⩾ 0 , update the estimate as θ ( j +1) = θ ( j ) − η ∇ ℓ j ( θ ( j ) ) , where η > 0 is the stepsize. 1 Note that the update only depends on the loss function evaluated at a single point ( x j , y j ) . a parameter η , which has to be set a priori . Tiere exists many methods to choose η adaptively, and Newton’s method is one of many such methods. Specifically, consider a quadratic approximation ˜ f of the function f at a point x : f ( x ′ ) ≜ f ( x ) + ∇ f ( x )( x ′ − x ) + 1 2( x ′ − x ) ⊺ ∇ 2 f ( x )( x ′ − x ) . ˜ Tie matrix ∇ 2 f ( x ) ∈ R d × d is called the Hessian of f and is defined as ∂ 2 f ∂ 2 f ∂ 2 f · · · ∂x 2 ∂x 1 ∂x 2 ∂x 1 ∂x d 1 ∂ 2 f ∂ 2 f ∂ 2 f · · · ∂x 2 ∂x 2 ∂x 1 ∂x 2 ∂x d ∇ 2 f ( x ) = 2 ∂ 2 f ∂ 2 f ∂ 2 f · · · ∂x 2 ∂x d ∂x 1 ∂x d ∂x 2 d f ( x ′ ) = ∇ f ( x ) + ∇ 2 f ( x ) x ′ − ∇ 2 f ( x ) x . Note that the gradient of ˜ f is ∇ ˜ choosing x j +1 as the minimum of the quadratic approximation of f at x j . Because ˜ f is quadratic, we can find the minimum by solving ∇ ˜ f ( x j +1 ) = 0 : ∇ ˜ f ( x j +1 ) = 0 ⇔ ∇ f ( x j ) + ∇ 2 f ( x j ) x j +1 − ∇ 2 f ( x j ) x j = 0 ] − 1 ∇ f ( x j ) . ∇ 2 f ( x j ) [ ⇔ x j +1 = x j − ] − 1 ; this adjusts how much we move as a function of the local curvature. ∇ 2 f ( x j ) [ lation of the inverse of the Hessian. Tiis is feasible when the dimension d is small but impractical when d is large. In many machine learning problems, researchers therefore focus on gradient descent
Recommend
More recommend