learning from data lecture 9 logistic regression and
play

Learning From Data Lecture 9 Logistic Regression and Gradient - PowerPoint PPT Presentation

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression Gradient Descent M. Magdon-Ismail CSCI 4100/6100 recap: Linear Classification and Regression The linear signal: s = w t x Good Features are Important


  1. Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression Gradient Descent M. Magdon-Ismail CSCI 4100/6100

  2. recap: Linear Classification and Regression The linear signal: s = w t x Good Features are Important Algorithms Linear Classification. Pocket algorithm can tolerate errors Simple and efficient Linear Regression. Single step learning: y w = X † y = (X t X) − 1 X t y Very efficient O ( Nd 2 ) exact algorithm. Before looking at the data, we can reason that x 1 symmetry and intensity should be good features x 2 based on our knowledge of the problem . M Logistic Regression and Gradient Descent : 2 /23 � A c L Creator: Malik Magdon-Ismail Predicting a probability − →

  3. Predicting a Probability Will someone have a heart attack over the next year? age 62 years gender male blood sugar 120 mg/dL40,000 HDL 50 LDL 120 Mass 190 lbs 5 ′ 10 ′′ Height . . . . . . Classification: Yes/No Logistic Regression: Likelihood of heart attack logistic regression ≡ y ∈ [0 , 1] � d � � = θ ( w t x ) h ( x ) = θ w i x i i =0 M Logistic Regression and Gradient Descent : 3 /23 � A c L Creator: Malik Magdon-Ismail What is θ ? − →

  4. Predicting a Probability Will someone have a heart attack over the next year? age 62 years gender male blood sugar 120 mg/dL40,000 HDL 50 LDL 120 Mass 190 lbs 5 ′ 10 ′′ Height . . . . . . Classification: Yes/No Logistic Regression: Likelihood of heart attack logistic regression ≡ y ∈ [0 , 1] 1 � d e s 1 � θ ( s ) = 1 + e s = 1 + e − s . θ ( s ) � = θ ( w t x ) h ( x ) = θ w i x i 1 e − s θ ( − s ) = 1 + e − s = 1 + e s = 1 − θ ( s ) . 0 i =0 s M Logistic Regression and Gradient Descent : 4 /23 � A c L Creator: Malik Magdon-Ismail Data is binary ± 1 − →

  5. The Data is Still Binary, ± 1 D = ( x 1 , y 1 = ± 1) , · · · , ( x N , y N = ± 1) x n ← a person’s health information y n = ± 1 ← did they have a heart attack or not We cannot measure a probability . We can only see the occurence of an event and try to infer a probability. M Logistic Regression and Gradient Descent : 5 /23 � A c L Creator: Malik Magdon-Ismail f is noisy − →

  6. The Target Function is Inherently Noisy f ( x ) = P [ y = +1 | x ] . The data is generated from a noisy target function:  f ( x ) for y = +1;   P ( y | x ) =   1 − f ( x ) for y = − 1 . M Logistic Regression and Gradient Descent : 6 /23 � A c L Creator: Malik Magdon-Ismail When is h good? − →

  7. What Makes an h Good? ‘fitting’ the data means finding a good h  h ( x n ) ≈ 1 whenever y n = +1;  h is good if:  h ( x n ) ≈ 0 whenever y n = − 1 . A simple error measure that captures this: N E in ( h ) = 1 � 2 . � � h ( x n ) − 1 2 (1 + y n ) N n =1 Not very convenient (hard to minimize). M Logistic Regression and Gradient Descent : 7 /23 � A c L Creator: Malik Magdon-Ismail Cross entropy error − →

  8. The Cross Entropy Error Measure N E in ( w ) = 1 � ln(1 + e − y n · w t x ) N n =1 It looks complicated and ugly (ln , e ( · ) , . . . ), But, – it is based on an intuitive probabilistic interpretation of h . – it is very convenient and mathematically friendly (‘easy’ to minimize). Verify: y n = +1 encourages w t x n ≫ 0, so θ ( w t x n ) ≈ 1; y n = − 1 encourages w t x n ≪ 0, so θ ( w t x n ) ≈ 0; M Logistic Regression and Gradient Descent : 8 /23 � A c L Creator: Malik Magdon-Ismail Probabilistic interpretation − →

  9. The Probabilistic Interpretation Suppose that h ( x ) = θ ( w t x ) closely captures P [+1 | x ]:  θ ( w t x ) for y = +1;   P ( y | x ) =   1 − θ ( w t x ) for y = − 1 . M Logistic Regression and Gradient Descent : 9 /23 � A c L Creator: Malik Magdon-Ismail 1 − θ ( s ) = θ ( − s ) − →

  10. The Probabilistic Interpretation So, if h ( x ) = θ ( w t x ) closely captures P [+1 | x ]:  θ ( w t x ) for y = +1;   P ( y | x ) =   θ ( − w t x ) for y = − 1 . M Logistic Regression and Gradient Descent : 10 /23 � A c L Creator: Malik Magdon-Ismail Simplify to one equation − →

  11. The Probabilistic Interpretation So, if h ( x ) = θ ( w t x ) closely captures P [+1 | x ]:  θ ( w t x ) for y = +1;   P ( y | x ) =   θ ( − w t x ) for y = − 1 . . . . or, more compactly, P ( y | x ) = θ ( y · w t x ) M Logistic Regression and Gradient Descent : 11 /23 � A c L Creator: Malik Magdon-Ismail The likelihood − →

  12. The Likelihood P ( y | x ) = θ ( y · w t x ) Recall: ( x 1 , y 1 ) , . . . , ( x N , y N ) are independently generated Likelihood : The probability of getting the y 1 , . . . , y N in D from the corresponding x 1 , . . . , x N : N � P ( y 1 , . . . , y N | x 1 , . . . , x n ) = P ( y n | x n ) . n =1 The likelihood measures the probability that the data were generated if f were h . M Logistic Regression and Gradient Descent : 12 /23 � A c L Creator: Malik Magdon-Ismail Maximize the likelihood − →

  13. Maximizing The Likelihood (why?) � N max n =1 P ( y n | x n ) �� N � ⇔ max ln n =1 P ( y n | x n ) � N ≡ max n =1 ln P ( y n | x n ) � N − 1 ⇔ min n =1 ln P ( y n | x n ) N � N 1 1 ≡ min n =1 ln P ( y n | x n ) N � N 1 1 ≡ min n =1 ln ← we specialize to our “model” here θ ( y n · w t x n ) N � N 1 n =1 ln(1 + e − y n · w t x n ) ≡ min N N E in ( w ) = 1 � ln(1 + e − y n · w t x n ) N n =1 M Logistic Regression and Gradient Descent : 13 /23 � A c L Creator: Malik Magdon-Ismail How to minimize E in ( w ) − →

  14. How To Minimize E in ( w ) Classification – PLA/Pocket (iterative) Regression – pseudoinverse (analytic), from solving ∇ w E in ( w ) = 0 . Logistic Regression – analytic won’t work. Numerically/iteratively set ∇ w E in ( w ) → 0 . M Logistic Regression and Gradient Descent : 14 /23 � A c L Creator: Malik Magdon-Ismail Hill analogy − →

  15. Finding The Best Weights - Hill Descent Ball on a complicated hilly terrain — rolls down to a local valley ↑ this is called a local minimum Questions: How to get to the bottom of the deepest valey? How to do this when we don’t have gravity? M Logistic Regression and Gradient Descent : 15 /23 � A c L Creator: Malik Magdon-Ismail Our E in is convex − →

  16. Our E in Has Only One Valley In-sample Error, E in Weights, w . . . because E in ( w ) is a convex function of w . (So, who care’s if it looks ugly!) M Logistic Regression and Gradient Descent : 16 /23 � A c L Creator: Malik Magdon-Ismail How to roll down? − →

  17. How to “Roll Down”? Assume you are at weights w ( t ) and you take a step of size η in the direction ˆ v . w ( t + 1) = w ( t ) + η ˆ v We get to pick ˆ v ← what’s the best direction to take the step? Pick ˆ v to make E in ( w ( t + 1)) as small as possible. M Logistic Regression and Gradient Descent : 17 /23 � A c L Creator: Malik Magdon-Ismail The gradient − →

  18. The Gradient is the Fastest Way to Roll Down Approximating the change in E in ∆ E in = E in ( w ( t + 1)) − E in ( w ( t )) = E in ( w ( t ) + η ˆ v ) − E in ( w ( t )) + O ( η 2 ) = η ∇ E in ( w ( t )) t ˆ v (Taylor’s Approximation) � �� � v = − ∇ E in ( w ( t )) minimized at ˆ | | ∇ E in ( w ( t )) | | > v = − ∇ E in ( w ( t )) ≈ − η | | ∇ E in ( w ( t )) | | ← attained at ˆ | | ∇ E in ( w ( t )) | | The best (steepest) direction to move is the negative gradient: v = − ∇ E in ( w ( t )) ˆ | | ∇ E in ( w ( t )) | | M Logistic Regression and Gradient Descent : 18 /23 � A c L Creator: Malik Magdon-Ismail Iterate the gradient − →

  19. “Rolling Down” ≡ Iterating the Negative Gradient w (0) ↓ ← negative gradient w (1) ↓ ← negative gradient w (2) ↓ ← negative gradient w (3) ↓ ← negative gradient . . . η = 0 . 5; 15 steps M Logistic Regression and Gradient Descent : 19 /23 � A c L Creator: Malik Magdon-Ismail What step size? − →

  20. The ‘Goldilocks’ Step Size η too small η too large variable η t – just right large η In-sample Error, E in In-sample Error, E in In-sample Error, E in small η Weights, w Weights, w Weights, w η = 0 . 1; 75 steps η = 2; 10 steps variable η t ; 10 steps M Logistic Regression and Gradient Descent : 20 /23 � A c L Creator: Malik Magdon-Ismail Fixed learning rate gradient descent − →

  21. Fixed Learning Rate Gradient Descent η t = η · | | ∇ E in ( w ( t )) | | 1: Initialize at step t = 0 to w (0). 2: for t = 0 , 1 , 2 , . . . do Compute the gradient | | ∇ E in ( w ( t )) | | → 0 when closer to the minimum. 3: ← − (Ex. 3.7 in LFD) g t = ∇ E in ( w ( t )) . ∇ E in ( w ( t )) v = − η t · ˆ Move in the direction v t = − g t . | | ∇ E in ( w ( t )) | | 4: Update the weights: ∇ E in ( w ( t )) 5: = − η · | | ∇ E in ( w ( t )) | | · | | ∇ E in ( w ( t )) | | w ( t + 1) = w ( t ) + η v t . Iterate ‘until it is time to stop’. 6: 7: end for v = − η · ∇ E in ( w ( t )) ˆ 8: Return the final weights. Gradient descent can minimize any smooth function, for example N E in ( w ) = 1 � ln(1 + e − y n · w t x ) ← logistic regression N n =1 M Logistic Regression and Gradient Descent : 21 /23 � A c L Creator: Malik Magdon-Ismail Stochastic gradient descent − →

Recommend


More recommend