statistical machine learning
play

Statistical Machine Learning Lecture 08: Regression Kristian - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 08: Regression Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 55 Todays Objectives Make you understand


  1. Statistical Machine Learning Lecture 08: Regression Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 1 / 55

  2. Today’s Objectives Make you understand how to learn a continuous function Covered Topics Linear Regression and its interpretations What is overfitting? Deriving Linear Regression from Maximum Likelihood Estimation Bayesian Linear Regression K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 2 / 55

  3. Outline 1. Introduction to Linear Regression 2. Maximum Likelihood Approach to Regression 3. Bayesian Linear Regression 4. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 3 / 55

  4. 1. Introduction to Linear Regression Outline 1. Introduction to Linear Regression 2. Maximum Likelihood Approach to Regression 3. Bayesian Linear Regression 4. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 4 / 55

  5. 1. Introduction to Linear Regression Reminder Our task is to learn a mapping f from input to output f : I → O , y = f ( x ; θ ) Input: x ∈ I (images, text, sensor measurements, ...) Output: y ∈ O Parameters: θ ∈ Θ (what needs to be “learned”) Regression Learn a mapping into a continuous space O = R O = R 3 . . . K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 5 / 55

  6. 1. Introduction to Linear Regression Motivation You want to predict the torques of a robot arm y = I ¨ q − µ ˙ q + mlg sin ( q ) � ¨ ˙ � � � ⊺ = q q sin( q ) I − µ mlg = φ ( x ) ⊺ θ Can we do this with a data set? � � � D = ( x i , y i ) � i = 1 · · · n � A linear regression problem! K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 6 / 55

  7. 1. Introduction to Linear Regression Least Squares Linear Regression We are given pairs of training data points and associated function values ( x i , y i ) � � x 1 ∈ R d , . . . , x n X = Y = { y 1 ∈ R , . . . , y n } Note: here we only do the case y i ∈ R . In general y i can have more than one dimension, i.e., y i ∈ R f for some positive f Start with linear regressor x ⊺ i w + w 0 = y i ∀ i = 1 , . . . , n One linear equation for each training data point/label pair Exactly the same basic setup as for least-squares classification! Only the values are continuous K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 7 / 55

  8. 1. Introduction to Linear Regression Least Squares Linear Regression x ⊺ i w + w 0 = y i ∀ i = 1 , . . . , n Step 1 : Define � x i � w � � ˆ ˆ x i = w = 1 w 0 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 8 / 55

  9. 1. Introduction to Linear Regression Least Squares Linear Regression x ⊺ i w + w 0 = y i ∀ i = 1 , . . . , n Step 1 : Define � x i � w � � ˆ ˆ x i = w = 1 w 0 Step 2 : Rewrite ˆ x ⊺ i ˆ w = y i ∀ i = 1 , . . . , n K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 8 / 55

  10. 1. Introduction to Linear Regression Least Squares Linear Regression x ⊺ i w + w 0 = y i ∀ i = 1 , . . . , n Step 1 : Define � x i � w � � ˆ ˆ x i = w = 1 w 0 Step 2 : Rewrite ˆ x ⊺ i ˆ w = y i ∀ i = 1 , . . . , n Step 3 : Matrix-vector notation X ⊺ ˆ ˆ w = y where ˆ X = [ˆ x 1 , . . . , ˆ x n ] (each ˆ x i is a vector) and y = [ y 1 , . . . , y n ] ⊺ K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 8 / 55

  11. 1. Introduction to Linear Regression Least Squares Linear Regression Step 4 : Find the least squares solution � � 2 � ˆ � � ˆ X ⊺ w − y w = arg min � w � � 2 � ˆ � � ∇ w X ⊺ w − y = 0 � X ⊺ � − 1 ˆ � X ˆ ˆ ˆ w = Xy A closed form solution! K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 9 / 55

  12. 1. Introduction to Linear Regression Least Squares Linear Regression X ⊺ � − 1 ˆ � X ˆ ˆ w = ˆ Xy Where is the costly part of this computation? K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 10 / 55

  13. 1. Introduction to Linear Regression Least Squares Linear Regression X ⊺ � − 1 ˆ � X ˆ ˆ w = ˆ Xy Where is the costly part of this computation? The inverse is a R D × D matrix � D 3 � Naive inversion takes O , but better methods exist What can we do if the input dimension D is too large? K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 10 / 55

  14. 1. Introduction to Linear Regression Least Squares Linear Regression X ⊺ � − 1 ˆ � X ˆ ˆ w = ˆ Xy Where is the costly part of this computation? The inverse is a R D × D matrix � D 3 � Naive inversion takes O , but better methods exist What can we do if the input dimension D is too large? Gradient descent Work with fewer dimensions K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 10 / 55

  15. 1. Introduction to Linear Regression Mechanical Interpretation K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 11 / 55

  16. 1. Introduction to Linear Regression Geometric Interpretation Predicted outputs are Linear Combinations of Features! Samples are projected in this Feature Space K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 12 / 55

  17. 1. Introduction to Linear Regression Polynomial Regression How can we fit arbitrary polynomials using least-squares regression? We introduce a feature transformation as before y ( x ) = w ⊺ φ ( x ) M � = w i φ i ( x ) i = 0 Assume φ 0 ( x ) = 1 φ i ( . ) are called the basis functions Still a linear model in the parameters w E.g. fitting a cubic polynomial � 1 , x , x 2 , x 3 � ⊺ φ ( x ) = K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 13 / 55

  18. � ✁ 1. Introduction to Linear Regression Polynomial Regression Polynomial of degree 0 (constant value) ✂☎✄✝✆ 1 0 −1 0 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 14 / 55

  19. � ✁ 1. Introduction to Linear Regression Polynomial Regression Polynomial of degree 1 (line) ✂☎✄✝✆ 1 0 −1 0 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 15 / 55

  20. � ✁ 1. Introduction to Linear Regression Polynomial Regression Polynomial of degree 3 (cubic) ✂☎✄✝✆ 1 0 −1 0 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 16 / 55

  21. � ✁ 1. Introduction to Linear Regression Polynomial Regression Polynomial of degree 9 ✂☎✄✝✆ 1 0 −1 0 1 Massive overfitting K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 17 / 55

  22. 2. Maximum Likelihood Approach to Regression Outline 1. Introduction to Linear Regression 2. Maximum Likelihood Approach to Regression 3. Bayesian Linear Regression 4. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 18 / 55

  23. 2. Maximum Likelihood Approach to Regression Overfitting Relatively little data leads to Enough data leads to a good overfitting estimate 1 1 0 0 − 1 − 1 0 1 0 1 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 19 / 55

  24. 2. Maximum Likelihood Approach to Regression Probabilistic Regression Assumption 1 : Our target function values are generated by adding noise to the function estimate y = f ( x , w ) + ǫ y - target function value; f - regression function; x - input value; w - weights or parameters; ǫ - noise K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 20 / 55

  25. 2. Maximum Likelihood Approach to Regression Probabilistic Regression Assumption 1 : Our target function values are generated by adding noise to the function estimate y = f ( x , w ) + ǫ y - target function value; f - regression function; x - input value; w - weights or parameters; ǫ - noise Assumption 2 : The noise is a random variable that is Gaussian distributed � 0 , β − 1 � ǫ ∼ N � � � � � � f ( x , w ) , β − 1 � � � = N p y � x , w , β y f ( x , w ) is the mean; β − 1 is the variance ( β is the precision) Note that y is now a random variable with underlying probability � � � � distribution p y � x , w , β K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 20 / 55

  26. 2. Maximum Likelihood Approach to Regression Probabilistic Regression K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 21 / 55

Recommend


More recommend