machine learning cse 446 learning as minimizing loss
play

Machine Learning (CSE 446): Learning as Minimizing Loss: - PowerPoint PPT Presentation

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements Assignment 2 due tomo. Midterm: Weds,


  1. Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent Sham M Kakade c � 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12

  2. Announcements ◮ Assignment 2 due tomo. ◮ Midterm: Weds, Feb 7th. ◮ Qz section: review ◮ Today: Regularization and Optimization! 1 / 12

  3. Review 1 / 12

  4. Relax! ◮ The mis-classification optimization problem: N 1 � min � y n ( w · x n ) ≤ 0 � N w n =1 ◮ Instead, use loss function ℓ ( y n , w · x ) and solve a relaxation: N 1 � min ℓ ( y n , w · x n ) N w n =1 2 / 12

  5. Relax! ◮ The mis-classification optimization problem: N 1 � min � y n ( w · x n ) ≤ 0 � N w n =1 ◮ Instead, use loss function ℓ ( y n , w · x ) and solve a relaxation: N 1 � min ℓ ( y n , w · x n ) N w n =1 ◮ What do we want? ◮ How do we get it? speed? accuracy? 2 / 12

  6. Some loss functions: ◮ The square loss: ℓ ( y, w · x ) = ( y − w · x ) 2 ◮ The logistic loss: ℓ logistic ( y, w · x ) = log (1 + exp( − y w · x )) . ◮ They both “upper bound” the mistake rate. ◮ Instead: ◮ Instead, we let’s care about “regression” where y is real valued. ◮ What if we have multiple classes? (not just binary classification?) 3 / 12

  7. Least squares: let’s minimize it! ◮ The optimization problem: N 1 ( y n − w · x n ) 2 = � min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ The solution is the least squares estimator : w least squares = ( X ⊤ X ) − 1 X ⊤ Y 4 / 12

  8. Matrix calculus proof: scratch space 5 / 12

  9. Matrix calculus proof: scratch space 5 / 12

  10. Let’s remember our linear system solving! 6 / 12

  11. Today 6 / 12

  12. Least squares: What could go wrong?! ◮ The optimization problem: N 1 ( y n − w · x n ) 2 = � min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ The solution is the least squares estimator : w least squares = ( X ⊤ X ) − 1 X ⊤ Y 7 / 12

  13. Least squares: What could go wrong?! ◮ The optimization problem: N 1 ( y n − w · x n ) 2 = � min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ The solution is the least squares estimator : w least squares = ( X ⊤ X ) − 1 X ⊤ Y What if d is bigger than n ? Even if not? 7 / 12

  14. What could go wrong? Suppose d > n : What about n > d ? 8 / 12

  15. What could go wrong? Suppose d > n : What about n > d ? ◮ What happens if features are very correlated? (e.g. ’rows/columns in our matrix are co-linear .) 8 / 12

  16. linear system solving: scratch space 8 / 12

  17. A fix: Regularization ◮ Regularize the optimization problem: N 1 ( y n − w · x n ) 2 + λ � w � 2 = � min N w n =1 w � Y − X ⊤ w � 2 + λ � w � 2 min ◮ This particular case: “Ridge” Regression, Tikhonov regularization ◮ The solution is the least squares estimator : � 1 � − 1 � 1 � w least squares = N X ⊤ X + λ I N X ⊤ Y 9 / 12

  18. The “general” approach ◮ The regularized optimization problem: N 1 � min ℓ ( y n , w · x n ) + R ( w ) N w n =1 ◮ Penalty some w more than others. Example: R ( w ) = � w � 2 How do we find a solution quickly? 10 / 12

  19. Remember: convexity 10 / 12

  20. Gradient Descent ◮ Want to solve: min F ( z ) z ◮ How should we update z? 11 / 12

  21. Gradient Descent Data : function F : R d → R , number of iterations K , step sizes � η (1) , . . . , η ( K ) � Result : z ∈ R d initialize: z (0) = 0 ; for k ∈ { 1 , . . . , K } do z ( k ) = z ( k − 1) − η ( k ) · ∇ z F ( z ( k − 1) ) ; end return z ( K ) ; Algorithm 1: GradientDescent 11 / 12

  22. Gradient Descent: Convergence ◮ Letting z ∗ = argmin z F ( z ) denote the global minimum ◮ Let z ( k ) be our parameter after k updates. ◮ Thm: Suppose F is convex and “ L -smooth”. Using a fixed step size η ≤ 1 L , we have: F ( z ( k ) ) − F ( z ∗ ) ≤ � z (0) − z ∗ � 2 η · k That is the convergence rate is O ( 1 k ) . 12 / 12

  23. Smoothness and Gradient Descent Convergence ◮ Smooth functions: for all z, z ′ �∇ F ( z ) − ∇ F ( z ′ ) � ≤ L � z − z ′ � ◮ Proof idea: 1. If our gradient is large, we will make good progress decreasing our function value: 2. If our gradient is small, we must have value near the optimal value: 12 / 12

Recommend


More recommend