Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent Sham M Kakade c � 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12
Announcements ◮ Assignment 2 due tomo. ◮ Midterm: Weds, Feb 7th. ◮ Qz section: review ◮ Today: Regularization and Optimization! 1 / 12
Review 1 / 12
Relax! ◮ The mis-classification optimization problem: N 1 � min � y n ( w · x n ) ≤ 0 � N w n =1 ◮ Instead, use loss function ℓ ( y n , w · x ) and solve a relaxation: N 1 � min ℓ ( y n , w · x n ) N w n =1 2 / 12
Relax! ◮ The mis-classification optimization problem: N 1 � min � y n ( w · x n ) ≤ 0 � N w n =1 ◮ Instead, use loss function ℓ ( y n , w · x ) and solve a relaxation: N 1 � min ℓ ( y n , w · x n ) N w n =1 ◮ What do we want? ◮ How do we get it? speed? accuracy? 2 / 12
Some loss functions: ◮ The square loss: ℓ ( y, w · x ) = ( y − w · x ) 2 ◮ The logistic loss: ℓ logistic ( y, w · x ) = log (1 + exp( − y w · x )) . ◮ They both “upper bound” the mistake rate. ◮ Instead: ◮ Instead, we let’s care about “regression” where y is real valued. ◮ What if we have multiple classes? (not just binary classification?) 3 / 12
Least squares: let’s minimize it! ◮ The optimization problem: N 1 ( y n − w · x n ) 2 = � min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ The solution is the least squares estimator : w least squares = ( X ⊤ X ) − 1 X ⊤ Y 4 / 12
Matrix calculus proof: scratch space 5 / 12
Matrix calculus proof: scratch space 5 / 12
Let’s remember our linear system solving! 6 / 12
Today 6 / 12
Least squares: What could go wrong?! ◮ The optimization problem: N 1 ( y n − w · x n ) 2 = � min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ The solution is the least squares estimator : w least squares = ( X ⊤ X ) − 1 X ⊤ Y 7 / 12
Least squares: What could go wrong?! ◮ The optimization problem: N 1 ( y n − w · x n ) 2 = � min N w n =1 w � Y − X w � 2 min where Y is an n -vector and X is our n × d data matrix. ◮ The solution is the least squares estimator : w least squares = ( X ⊤ X ) − 1 X ⊤ Y What if d is bigger than n ? Even if not? 7 / 12
What could go wrong? Suppose d > n : What about n > d ? 8 / 12
What could go wrong? Suppose d > n : What about n > d ? ◮ What happens if features are very correlated? (e.g. ’rows/columns in our matrix are co-linear .) 8 / 12
linear system solving: scratch space 8 / 12
A fix: Regularization ◮ Regularize the optimization problem: N 1 ( y n − w · x n ) 2 + λ � w � 2 = � min N w n =1 w � Y − X ⊤ w � 2 + λ � w � 2 min ◮ This particular case: “Ridge” Regression, Tikhonov regularization ◮ The solution is the least squares estimator : � 1 � − 1 � 1 � w least squares = N X ⊤ X + λ I N X ⊤ Y 9 / 12
The “general” approach ◮ The regularized optimization problem: N 1 � min ℓ ( y n , w · x n ) + R ( w ) N w n =1 ◮ Penalty some w more than others. Example: R ( w ) = � w � 2 How do we find a solution quickly? 10 / 12
Remember: convexity 10 / 12
Gradient Descent ◮ Want to solve: min F ( z ) z ◮ How should we update z? 11 / 12
Gradient Descent Data : function F : R d → R , number of iterations K , step sizes � η (1) , . . . , η ( K ) � Result : z ∈ R d initialize: z (0) = 0 ; for k ∈ { 1 , . . . , K } do z ( k ) = z ( k − 1) − η ( k ) · ∇ z F ( z ( k − 1) ) ; end return z ( K ) ; Algorithm 1: GradientDescent 11 / 12
Gradient Descent: Convergence ◮ Letting z ∗ = argmin z F ( z ) denote the global minimum ◮ Let z ( k ) be our parameter after k updates. ◮ Thm: Suppose F is convex and “ L -smooth”. Using a fixed step size η ≤ 1 L , we have: F ( z ( k ) ) − F ( z ∗ ) ≤ � z (0) − z ∗ � 2 η · k That is the convergence rate is O ( 1 k ) . 12 / 12
Smoothness and Gradient Descent Convergence ◮ Smooth functions: for all z, z ′ �∇ F ( z ) − ∇ F ( z ′ ) � ≤ L � z − z ′ � ◮ Proof idea: 1. If our gradient is large, we will make good progress decreasing our function value: 2. If our gradient is small, we must have value near the optimal value: 12 / 12
Recommend
More recommend