introduction to machine learning
play

Introduction to Machine Learning Linear Regression Prof. Andreas - PowerPoint PPT Presentation

Introduction to Machine Learning Linear Regression Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch) Basic Supervised Learning Pipeline Training Data Test Data spam ? Learning Predic- ham Model method ?


  1. Introduction to Machine Learning Linear Regression Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

  2. Basic Supervised Learning Pipeline Training Data Test Data “spam” ? Learning Predic- “ham” Model method ? tion “spam” ? f : X → Y : X → Y Prediction/ Model fitting Generalization 2

  3. Regression Instance of supervised learning Goal : Predict real valued labels (possibly vectors) Examples: X Y Flight route Delay (minutes) Real estate objects Price Customer & ad features Click-through probability 3

  4. Running example: Diabetes [Efron et al ‘04] Features X: Age Sex Body mass index Average blood pressure Six blood serum measurements (S1-S6) Label (target) Y quantitative measure of disease progression 4

  5. Regression y + + + + + + + + + x Goal : learn real valued mapping f : R d → R 5

  6. Important choices in regression What types of functions f should we consider? Examples f(x) f(x) + + + + + + + + + + + + + + + + + + + ++ + + + + x x How should we measure goodness of fit? 6

  7. Example: linear regression y + + + + + + + + + x 7

  8. Homogeneous representation 8

  9. Quantifying goodness of fit x i ∈ R d D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } y i ∈ R y + + + + + + ++ + x 9

  10. <latexit sha1_base64="fDIMbQ9YlPRf7DMzabXpUaK0cNU=">AB+XicbVDLSsNAFJ34rPUVdelmsAiuSlIFXbgouHFZwT6gCWUynbRDJ5Mwc1MpIX/ixoUibv0Td/6NkzYLbT0wcDjnXu6ZEySCa3Ccb2tfWNza7uyU93d2z84tI+OzpOFWVtGotY9QKimeCStYGDYL1EMRIFgnWDyV3hd6dMaR7LR5glzI/ISPKQUwJGti2NyaQeRGBcRBmT3k+sGtO3ZkDrxK3JDVUojWwv7xhTNOISaCaN13nQT8jCjgVLC86qWaJYROyIj1DZUkYtrP5slzfG6UIQ5jZ4EPFd/b2Qk0noWBWayiKiXvUL8z+unEN74GZdJCkzSxaEwFRhiXNSAh1wxCmJmCKGKm6yYjokiFExZVOCu/zlVdJp1N3LeuPhqta8LeuoFN0hi6Qi65RE92jFmojiqboGb2iNyuzXqx362MxumaVOyfoD6zPH1P/lBc=</latexit> Least-squares linear regression optimization [Legendre 1805, Gauss 1809] Given data set D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } How do we find the optimal weight vector? n w ∗ = arg min X ( y i − w T x i ) 2 ˆ w w i =1 10

  11. <latexit sha1_base64="fDIMbQ9YlPRf7DMzabXpUaK0cNU=">AB+XicbVDLSsNAFJ34rPUVdelmsAiuSlIFXbgouHFZwT6gCWUynbRDJ5Mwc1MpIX/ixoUibv0Td/6NkzYLbT0wcDjnXu6ZEySCa3Ccb2tfWNza7uyU93d2z84tI+OzpOFWVtGotY9QKimeCStYGDYL1EMRIFgnWDyV3hd6dMaR7LR5glzI/ISPKQUwJGti2NyaQeRGBcRBmT3k+sGtO3ZkDrxK3JDVUojWwv7xhTNOISaCaN13nQT8jCjgVLC86qWaJYROyIj1DZUkYtrP5slzfG6UIQ5jZ4EPFd/b2Qk0noWBWayiKiXvUL8z+unEN74GZdJCkzSxaEwFRhiXNSAh1wxCmJmCKGKm6yYjokiFExZVOCu/zlVdJp1N3LeuPhqta8LeuoFN0hi6Qi65RE92jFmojiqboGb2iNyuzXqx362MxumaVOyfoD6zPH1P/lBc=</latexit> <latexit sha1_base64="fDIMbQ9YlPRf7DMzabXpUaK0cNU=">AB+XicbVDLSsNAFJ34rPUVdelmsAiuSlIFXbgouHFZwT6gCWUynbRDJ5Mwc1MpIX/ixoUibv0Td/6NkzYLbT0wcDjnXu6ZEySCa3Ccb2tfWNza7uyU93d2z84tI+OzpOFWVtGotY9QKimeCStYGDYL1EMRIFgnWDyV3hd6dMaR7LR5glzI/ISPKQUwJGti2NyaQeRGBcRBmT3k+sGtO3ZkDrxK3JDVUojWwv7xhTNOISaCaN13nQT8jCjgVLC86qWaJYROyIj1DZUkYtrP5slzfG6UIQ5jZ4EPFd/b2Qk0noWBWayiKiXvUL8z+unEN74GZdJCkzSxaEwFRhiXNSAh1wxCmJmCKGKm6yYjokiFExZVOCu/zlVdJp1N3LeuPhqta8LeuoFN0hi6Qi65RE92jFmojiqboGb2iNyuzXqx362MxumaVOyfoD6zPH1P/lBc=</latexit> Method 1: Closed form solution n The problem w ∗ = arg min X ( y i − w T x i ) 2 ˆ w w i =1 can be solved in closed form: w ∗ = ( X T X ) − 1 X T y ˆ w Hereby: 11

  12. How to solve? Example: Scikit Learn 12

  13. Demo Disease progression Body mass index 13

  14. Method 2: Optimization The objective function ˆ X ( y i − w T x i ) 2 R ( w ) = i is convex! 14

  15. Gradient Descent Start at an arbitrary w 0 ∈ R d For t=1,2,... do w t +1 = w t � η t r ˆ R ( w t ) Hereby, is called learning rate η t 15

  16. Convergence of gradient descent Under mild assumptions, if step size sufficiently small, gradient descent converges to a stationary point (gradient = 0) For convex objectives, it therefore finds the optimal solution! In the case of the squared loss, constant stepsize ½ converges linearly 16

  17. Computing the gradient 17

  18. Demo: Gradient descent 18

  19. Choosing a stepsize What happens if we choose a poor stepsize? 19

  20. Adaptive step size Can update the step size adaptively. For example: 1) Via line search (optimizing step size every step) 2) „Bold driver“ heuristic If function decreases, increase step size: If function increases, decrease step size: 20

  21. Demo: Gradient Descent for Linear Regression 21

  22. Gradient descent vs closed form Why would one ever consider performing gradient descent, when it is possible to find closed form solution? Computational complexity May not need an optimal solution Many problems don‘t admit closed form solution 22

  23. Other loss functions So far: Measure goodness of fit via squared error Many other loss functions possible (and sensible!) 23

Recommend


More recommend