chapter 4 training regression models
play

Chapter 4: Training Regression Models Dr. Xudong Liu Assistant - PowerPoint PPT Presentation

Chapter 4: Training Regression Models Dr. Xudong Liu Assistant Professor School of Computing University of North Florida Monday, 10/14/2019 1 / 41 Overview Linear regression Normal equation Gradient descent Batch gradient descent


  1. Chapter 4: Training Regression Models Dr. Xudong Liu Assistant Professor School of Computing University of North Florida Monday, 10/14/2019 1 / 41

  2. Overview Linear regression Normal equation Gradient descent Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Polynomial regression Regularization for linear models Ridge regression Lasso regression Elastic Net Logistic regression Softmax regression Overview 2 / 41

  3. Linear Regression Model Overview 3 / 41

  4. Linear Regression Model Overview 4 / 41

  5. Learning Linear Regression Models We try to learn θ so that the following MSE cost function is minimized. Linear Regression Learning 5 / 41

  6. Normal Equation To find the θ that minimized the cost function, we can apply the following closed-form solution: How is it derived? Linear Regression Learning 6 / 41

  7. Normal Equation Under the cost function, we are looking for a line that minimize the summation of the distances from the data points to the line. The cost function is convex, so all we need to do is to compute the cost function’s partial derivative w.r.t θ , and make the partial derivative to 0 to solve for θ . m ( θ T · X ( i ) − y ( i ) ) 2 also minimizes The θ that minimizes 1 � m i =1 m ( θ T · X ( i ) − y ( i ) ) 2 � i =1 Linear Regression Learning 7 / 41

  8. Normal Equation m ( θ T · X ( i ) − y ( i ) ) 2 = ( y − X θ ) T ( y − X θ ) = E θ � 1 i =1 ∂ E θ ∂θ = ( − X ) T ( y − X θ ) + ( − X )( y − X θ ) T , this is because we have 2 ∂ u T v = ∂ u T ∂ X v + ∂ v T ∂ X u ∂ X 3 Thus, ∂ E θ ∂θ = 2 X T ( X θ − y ) ∂θ = 0, we can solve for θ = ( X T · X ) − 1 · X T y 4 Let ∂ E θ Linear Regression Learning 8 / 41

  9. Normal Equation Experiment Linear Regression Learning 9 / 41

  10. Normal Equation Complexity Using the normal equation takes time roughly O ( mn 3 ), where n is the number of features, and m is the number of examples. So, it scales well with large set of examples, but poorly with big number of features. Now let’s look at other ways of learning θ that may be better for when there are a lot of features or too many training examples to fit in memory. Linear Regression Learning 10 / 41

  11. Gradient Descent General idea is to tweak parameters iteratively in order to minimize a cost function. Analogy: suppose you are lost in the mountains in a dense fog. You can only feel the slope of the ground below your fee. A good strategy to get to the bottom quickly is to go downhill in the direction of the steepest slope. The step size during the walking downhill is called learning rate . Gradient Descent 11 / 41

  12. Learning Rate Too Small Gradient Descent 12 / 41

  13. Learning Rate Too Big Gradient Descent 13 / 41

  14. Gradient Descent Pitfalls Gradient Descent 14 / 41

  15. Gradient Descent with Feature Scaling When all features have a similar scale, GD tends to converge quick. Gradient Descent 15 / 41

  16. Batch Gradient Descent The gradient vector of the MSE cost function is below. It is something we already computed for normal equation! Gradient descent step: θ ( nextstep ) → θ − η ∗ ▽ θ MSE ( θ ) Gradient Descent 16 / 41

  17. Batch Gradient Descent: Learning Rates Gradient Descent 17 / 41

  18. Other Gradient Descent Algorithms Batch GD: uses the whole training set to compute GD. Stochastic GD: uses one random training example to compute GD. Mini-batch GD: uses a small random set of training examples to compute GD. Gradient Descent 18 / 41

  19. Comparing Linear Regression Algorithms Gradient Descent 19 / 41

  20. Polynomial Regression y = θ 0 + θ 1 x + . . . + θ d x d Univariate polynomial: ˆ Multivariate polynomial: more complex E.g., for degree 2 and 2 variables, y = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 1 x 2 + θ 4 x 2 1 + θ 5 x 2 ˆ 2 � n + d = ( n + d )! � 1 . In general, for degree d and n variables, there are d d ! n ! Clearly, polynomial regression is more general than linear regression and can fit non-linear data. 1 https://mathoverflow.net/questions/225953/ number-of-polynomial-terms-for-certain-degree-and-certain-number-of-variables Polynomial Regression 20 / 41

  21. Polynomial Regression Learning We may transform the given attributes to get additional attributes in the higher degrees. Then the problem boils down to a linear regression problem. The following data is generated for y = 0 . 5 x 2 1 + 1 . 0 x 1 + 2 . 0+ Gaussian noise. Polynomial Regression 21 / 41

  22. Polynomial Regression Learning We first transform x 1 to a polynomial feature of degree 2, then fit the y = 0 . 56 x 2 data to learn ˆ 1 + 0 . 93 x 1 + 1 . 78. � n + d � The PolynomialFeatures class in sklearn can produce all d polynomial features. Apparently, we wouldn’t know what degree exactly the data was generated. Polynomial Regression 22 / 41

  23. Polynomial Regression Learning: Overfitting Generally, the higher the polynomial degree the better the model fits the training data. The danger is overfitting, making the model generate poorly on testing/future data. Polynomial Regression 23 / 41

  24. Learning Curves: Underfitting Adding more training examples will not help underfitting. Instead, need to use more complex models. Polynomial Regression 24 / 41

  25. Learning Curves: Overfitting Adding more training examples may help overfitting. Another way to battle overfitting is regularization . Polynomial Regression 25 / 41

  26. Ridge Regression To regularize a model is to constrain it: the less freedom it has, the hard it will be to overfit. For linear regression, this regularization typically is achieved by constraining the weights ( θ ’s) of the model. First way to constrain the weights is Ridge regression , which simply n adds α 1 θ 2 � i to the cost function: 2 i =1 n J ( θ ) = MSE ( θ ) + α 1 θ 2 � 2 i i =1 Notice θ 0 is NOT constrained. Remember to scale the data before using Ridge regression. α is a hyperparameter: bigger results in flatter and smoother model. Regularized Linear Models 26 / 41

  27. Ridge Regression: α Regularized Linear Models 27 / 41

  28. Ridge Regression Closed-form solution: θ = ( X T · X + α A ) − 1 · X T y , ˆ where A is the n × n identity matrix except top-left cell is 0. In sklearn : import Ridge Stochastic GD: ▽ θ MSE ( θ ) = 2 m X T ( X θ − y ) + 1 2 αθ In sklearn : SGDRegressor(penalty=”l2”) Regularized Linear Models 28 / 41

  29. Lasso Regression Second way to constrain the weights is Lasso regression . Lasso: least absolute shrinkage and selection operator. It add an l 1 norm, instead of an l 2 norm in Ridge, to the cost function: n � J ( θ ) = MSE ( θ ) + α | θ i | i =1 n 1 | x i | p ) � p -norm: || x || p = ( p i =1 Lasso tends to completely eliminate the weights of the least important features (i.e., setting them to 0) and it automatically performs feature selection. Role of α is same as the α in Ridge regression. Regularized Linear Models 29 / 41

  30. Lasso Regression Regularized Linear Models 30 / 41

  31. Lasso Regression Closed-form solution: Does not exist because J ( θ ) is not differentiable at θ i = 0. Stochastic GD: ▽ θ MSE ( θ ) = 2 m X T ( X θ − y ) + 1 2 α sign ( θ ), where sign ( θ i ) = − 1 , if θ i < 0; 0 , if θ i = 0; and 1 , if θ i > 0. In sklearn : SGDRegressor(penalty=”l1”), or import Lasso Regularized Linear Models 31 / 41

  32. Elastic Net Last way to constrain the weights is Elastic net , a combination of Ridge and Lasso. It combines both cost functions: n n | θ i | + 1 − r 2 α 1 θ 2 J ( θ ) = MSE ( θ ) + r α � � 2 i i =1 i =1 When to use which? Ridge is a good default. If you suspect some features are not useful, use Lasso or Elastic. When features are more than training examples, prefer Elastic. Regularized Linear Models 32 / 41

  33. Logistic Regression Logistic regression outputs class probabilities for binary regression problems, which can be used to predict classes for binary classification problems too. Although called regression, it often is used for binary classification. Multi-class regression or classification? Softmax regression. (next) The logistic regression model: p = h θ ( x ) = σ ( θ T · x ), ˆ 1 where σ ( t ) = 1+ e − t is the logistic function. Logistic Regression 33 / 41

  34. Logistic Regression Once ˆ p is computed for example x , it predicts the class ˆ y to be 0 if p < 0 . 5; 1, otherwise. ˆ y to be 0 if θ T · x is negative; 1, otherwise. In other words, it predicts ˆ Now we design a cost function, first for one single training example. For positive examples, we want the cost function to be small when the probability is big; for negative examples, when the prbability is small. Cost function per single training example: � − log(ˆ p ) , if y = 1 . c ( θ ) = − log(1 − ˆ p ) , if y = 0 . Overall cost function (log loss): m [ y i log(ˆ J ( θ ) = − 1 p ( i ) ) + (1 − y i ) log((1 − ˆ p ( i ) ))] � m i =1 Logistic Regression 34 / 41

Recommend


More recommend