Lecture 9 Logistic regression Lecture 9 Logistic regression 10 ‐ 17 ‐ 2008
Review Review • Training a Naïve Bayes classifier g y – p(y), p(x i |y) for i=1,…, m • Predicting with Naïve Bayes Classifier – p(y|X) = p(X|y)p(y)/p(X) – predict y that maximizes p(y|X) • Zero probabilities cause headache for Bayes • Zero probabilities cause headache for Bayes classifiers – Laplace smoothing • Generative vs. discriminative approaches – Naïve Bayes is a generative approach
Logistic Regression Logistic Regression • Assume that the log odds of y=1 is a linear function of x : x : = ( 1 | ) P y x = + + + log ... w w x w m x = 0 1 1 m ( 0 | ) P y x • Or equivalently, we have: 1 1 = = ( 1 | ) P y x − + + + + ( ... ) w w x w m x 1 e 0 1 1 m Sigmoid function Side Note: the odds in favor of an event are the quantity p / (1 − p ), where p is th the probability of the event b bilit f th t If I toss a fair dice, what are the odds that I will have a six?
Learning w for logistic regression t 1 = 1 t − + + v 1 e e v Given a set of training data points we would like to find a weight vector w Given a set of training data points, we would like to find a weight vector w • 1 = = ( 1 | ) P y X such that − + + + + ( ... ) w w x w m x 1 e 0 1 1 m is large (e.g. 1) for positive training examples, and small (e.g. 0) otherwise s a ge (e.g. ) o pos t e t a g e a p es, a d s a (e.g. 0) ot e se In other words, a good weight vector W should satisfy the following: • if we plot (v = W ∙ X i , t=y i ), i=1, …, n, they should be in the area close to t=0 (for y i =0) and t=1 (for y i =1)
Learning w for logistic regression • This can be captured in the following objective function: ∑ = i i ( ) log ( | ) w x , w L P y i ∑ ∑ = = = = + + − − = = i i i i i i [ [ log log ( ( 1 1 | | ) ) ( ( 1 1 ) ) log( log( 1 1 ( ( 1 1 | | ))] ))] y y P P y y x x , w w y y P P y y x x , w w i Note that the superscript i is an index to the examples in the training set This is call the likelihood function of w and by maximizing this objective function we This is call the likelihood function of w, and by maximizing this objective function, we perform what we call “maximum likelihood estimation” of the parameter w.
Maximum Likelihood Estimation Maximum Likelihood Estimation Goal : estimate the parameters given data Assuming the data is i.i.d (identically independently distributed) For example given the results of n coin tosses we like to For example, given the results of n coin tosses, we like to estimate the probability of head p. Likelihood function: n n ∏ ∑ θ = θ = θ = θ i i i i ( ) log ( | ) log ( , | ) log ( , | ) L P D P x y P x y = = 1 1 i i MLE estimator: MLE estimator: θ MLE = θ arg max ( ) L θ
Example Example • Data: n iid coin toss: D={0, 0, 1, 0,…1} θ = ( = • Parameter 1 ) P x = θ − θ − 1 x x • Binary distribution ( ) ( 1 ) P x • Likelihood function? • MLE estimate? MLE i ?
Example Example • Data: n iid coin toss: D={0, 0, 1, 0,…1} θ = ( = • Parameter 1 ) P x = θ − θ − 1 x x • Binary distribution ( ) ( 1 ) P x • Likelihood function? θ = θ − θ = θ + − θ n n ( ) log ( 1 ) log log( 1 ) L n n 1 0 1 0 • MLE estimate? MLE i ? dL n n = − = ⇒ 1 0 0 θ θ − θ ( 1 ) d n n = ⇒ − θ = θ ⇒ 1 0 ( 1 ) n n θ − θ 1 0 ( 1 ) n n = θ + θ ⇒ θ = 1 1 n n n 1 1 0 + n n 1 0
MLE for logistic regression MLE for logistic regression n n ∏ ∏ ∑ ∑ ( θ = = = i i i i ( ) ) log g ( ( | | ) ) log g ( ( , , | | ) ) log g ( ( , , | | ) ) L P D W P X y y W P X y y W = = 1 i 1 i n n n ∑ ∑ ∑ = = + i i i i i i log( ( | , ) ( | ) ) log ( | , ) ( | ) P y W X P X W P y W X P X W = = = 1 1 1 1 1 1 i i i i i i n ∑ ∑ = = i i arg max ( ) arg max log ( | , ) W L W P y W X MLE W W = 1 i n ∑ = = + − − = i i i i i i arg max log( ( 1 | , ) ( 1 )( 1 ( 1 | , )) y P y W X y P y W X = W 1 i Equivalently, given a set of training data points, we would like to find a weight = ( 1 | , ) P y W X vector w such that is large (e.g. 1) for positive training examples, and small (e.g. 0) otherwise – the same as our intuition
Optimizing L(w) Optimizing L(w) • Unfortunately this does not have a close form Unfortunately this does not have a close form solution • Instead we iteratively search for the optimal • Instead, we iteratively search for the optimal w • Start with a random w, iteratively improve w S i h d i i l i (similar to Perceptron)
Logistic regression learning Logistic regression learning Learning rate
Batch Learning for Logistic Regression = i , y i ( ) 1 ,..., Given : training examples x , i N ← Let w ( , , , (0,0,0, ...,0) , ) Repeat until convergenc e d ← (0,0,0, ...,0) = 1 For i to N do ) 1 ← i y − + + i w x · 1 1 e e ) = − i i error y y = + ⋅ i x d d error ← w + η w d Note: y takes 0/1 here, not 1/ ‐ 1
Logistic Regression Vs. Perceptron Logistic Regression Vs. Perceptron • Note the striking similarity between the two ote t e st g s a ty bet ee t e t o algorithms • In fact LR learns a linear decision boundary – how y so? – We can show that mathematically, see board • What are the difference? – Different ways to train the weights – LR produces a probability estimation d b bili i i – (a maybe not so interesting difference)LR by statistician and Perceptron by CS statistician and Perceptron by CS
There are more! There are more! • If we assume Gaussian distribution for p(x i |y) in Naïve Bayes, p(y=1|X) will take the same functional N ï B ( 1|X) ill k h f i l form of Logistic Regression • What are the differences here? Wh t th diff h ? – Different ways of training • Naïve bayes estimates θ i by maximizing P(X|y=v i θ i ) and while Naïve bayes estimates θ i by maximizing P(X|y v i , θ i ), and while doing so assumes conditional independence among attributes • Logistic regression estimates w by maximizing P(y|x, w ) and make no conditional independence assumption. no conditional independence assumption.
Comparatively Comparatively • Naïve Bayes ‐ generative model: P(X|y) – makes strong conditional independence assumption about makes strong conditional independence assumption about the data attributes – When the assumptions are ok, naïve bayes can use small amount of training data and estimate a reasonable model amount of training data and estimate a reasonable model • Logistic regression ‐ discriminative model: directly learn p(y|X) – has fewer parameters to estimate, but they are tied h f i b h i d together and make learning harder – Makes no strong assumptions – May need large number of training examples Bottom line: if the naïve bayes assumption holds and the probabilistic models are accurate (i.e., x is gaussian given y etc.), NB would be a good d l (i i i i ) ld b d choice; otherwise, logistic regression works better
Summary Summary • We introduced the concept of generative vs. discriminative method – Given a method that we discussed in class, you need to know which category it belongs to • Logistic regression – Assumes that the log odds of y=1 is a linear function of X (i.e., W ∙ X) – Learning goal is to learn a weight vector W such that examples with y=1 are predicted to have high P(y=1|X) and vice versa • Maximum likelihood estimation is a approach that achieves this • Iterative algorithm to learn W using MLE • Iterative algorithm to learn W using MLE • Similarity and difference between LR and perceptrons – Logistic regression learns a linear decision boundarys
Recommend
More recommend