Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 /
Agenda Agenda Probabilistic Classification Introduction to Logistic regression Binary logistic regression Logistic regression: Decision surface Logistic regression: ML estimation Logistic regression: Gradient descent Logistic regression: multi-class Logistic Regression: Regularization Logistic Regression VS. Naïve Bayes Sharif University of Technology, Computer Engineering Department, Machine Learning Course 2
Probab Probabil ilis isti tic C c Classi lassifi ficati cation on Generative probabilistic classification (Previous lecture) motivation: assume a distribution for each class and try to find the parameters for the distributions cons: need to assume distributions; need to fit many parameters Discriminative approach: Logistic regression (Focus of today) motivation: like least square, but assume logistic distribution y(x) = (wTx); classify based on y(x) > 0:5 or not. technique: gradient descent Sharif University of Technology, Computer Engineering Department, Machine Learning Course 3
Int Introducti roduction to on to Logisti Logistic r c regression egression Logistic regression represents the probability of category i using a linear function of the input variables: The name comes from the logit transformation: Sharif University of Technology, Computer Engineering Department, Machine Learning Course 4
Bi Binary logist nary logistic regressi ic regression on Logistic Regression assumes a parametric form for the distribution ( | ) P Y X then directly estimates its parameters from the training data. The Y parametric model assumed by Logistic Regression in the case where is boolean is: Notice that equation (2) follows directly from equation (1), because the sum of these two probabilities must equal 1. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 5
Bi Binary logist nary logistic regressi ic regression on We only need one set of parameters: Sigmoid (logistic) function Sharif University of Technology, Computer Engineering Department, Machine Learning Course 6
Logisti Logistic r c regression egression vs. Linear vs. Linear r regression egression Adapted from slides of John Whitehead Sharif University of Technology, Computer Engineering Department, Machine Learning Course 7
Logisti Logistic r c regression: egression: Decisi Decision surf on surface ace Given a logistic regression W and an X: Decision surface 𝑔 ( 𝒚 ; 𝒙 )=constant Decision surfaces are linear functions of 𝒚 Decision making on Y: Sharif University of Technology, Computer Engineering Department, Machine Learning Course 8
Computing the likelihood in details We can re-express the log of the conditional likelihood as: l l l l l l ( ) w ln ( 1| x w , ) (1 )ln ( 0| x w , ) l y P y y P y l l l ( 1| x w , ) P y l l l ln ln ( 0| x w , ) y P y l l ( 0| x w , ) P y l n n l l l ( ) ln(1 exp( )) y w w x w w x 0 0 i i i i 1 1 l i i Sharif University of Technology, Computer Engineering Department, Machine Learning Course 9
Logistic regression: ML estimation is a concave in w What is a concave and a convex function? No closed form solution Sharif University of Technology, Computer Engineering Department, Machine Learning Course 10
Opti ptimi mizing co zing concav ncave/convex e/convex functi function on Maximum of a concave function = minimum of a convex function Gradient ascent (concave) / Gradient descent (convex) Sharif University of Technology, Computer Engineering Department, Machine Learning Course 11
Gradi radient a ent ascen scent t / G / Gradi radient d ent desce escent nt For function f(w) If f is concave : Gradient ascent rule If f is convex: Gradient descent rule Sharif University of Technology, Computer Engineering Department, Machine Learning Course 12
Logistic regression: Gradient descent Iteratively updating the weights in this fashion increases likelihood each round. We eventually reach the maximum We are near the maximum when changes in the weights are small. Thus, we can stop when the sum of the absolute values of the weight differences is less than some small number. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 13
Logistic regression: multi-class In the two-class case For multiclass, we work with soft-max function instead of logistic sigmoid Aka Softmax Sharif University of Technology, Computer Engineering Department, Machine Learning Course 14
Logisti Logistic R c Regression: egression: Regulari Regularizati zation on Overfitting the training data is a problem that can arise in Logistic Regression, especially when data has very high dimensions and is sparse. One approach to reducing overfitting is regularization, in which we create a modified “penalized log likelihood function,” which penalizes large values of w. l l 2 w = argmax ln ( | x w , ) || w || P y 2 w l The derivative of this penalized log likelihood function is similar to our earlier derivative, with one additional penalty term ˆ l l l l ( ) w ( ( 1| x w , )) l x y P y w i i w l i which gives us the modified gradient descent rule ˆ l l l l ( ( 1| x w , )) w w x y P y w i i i i l Sharif University of Technology, Computer Engineering Department, Machine Learning Course 15
Logisti Logistic R c Regression VS. egression VS. N Naïve Bayes aïve Bayes In general, NB and LR make different assumptions NB: Features independent given class -> assumption on P(X|Y) LR: Functional form of P(Y|X), no assumption on P(X|Y) LR is a linear classifier decision rule is a hyperplane LR optimized by conditional likelihood no closed-form solution concave -> global optimum with gradient ascent Sharif University of Technology, Computer Engineering Department, Machine Learning Course 16
Logisti Logistic R c Regression VS. egression VS. N Naïve Bayes aïve Bayes Consider Y and Xi boolean, X=<X1... Xn> Number of parameters: NB: 2n +1 LR: n+1 Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled Sharif University of Technology, Computer Engineering Department, Machine Learning Course 17
Logistic Regression VS. Gaussian Naive Bayes When the GNB modeling assumptions do not hold, Logistic Regression and GNB typically learn different classifier functions Logistic Regression is consistent with the Naïve Bayes assumption that the input features Xi are conditionally independent given Y ,it is not rigidly tied to this assumption as is Naive Bayes. GNB parameter estimates converge toward their asymptotic values in order log(n) examples, where n is the dimension of X . Logistic Regression parameter estimates converge more slowly, requiring order (n ) examples. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 18
Summary Logistic Regression learns the Conditional Probability Distribution P(y|x) Local Search. Begins with initial weight vector. Modifies it iteratively to maximize an objective function. The objective function is the conditional log likelihood of the data: so the algorithm seeks the probability distribution P(y|x) that is most likely given the data. Sharif University of Technology, Computer Engineering Department, Machine Learning Course 19
Any Q Any Questi uestion on End of Lecture 9 Thank you! Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/ Sharif University of Technology, Computer Engineering Department, Machine Learning Course 20
Recommend
More recommend