stochastic gradient descent
play

Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer - PowerPoint PPT Presentation

Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer Science Department Cargenie Mellon University February 5, 2013 The problem A typical machine learning problem has a penalty/regularizer + loss form n w F ( w ) = g ( w ) + 1


  1. Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer Science Department Cargenie Mellon University February 5, 2013

  2. The problem ◮ A typical machine learning problem has a penalty/regularizer + loss form n w F ( w ) = g ( w ) + 1 � min f ( w ; y i , x i ) , n i =1 x i , w ∈ R p , y i ∈ R , both g and f are convex ◮ Today we only consider differentiable f , and let g = 0 for simplicity ◮ For example, let f ( w ; y i , x i ) = − log p ( y i | x i , w ), we are trying to maximize the log likelihood, which is n 1 � max log p ( y i | x i , w ) n w i =1

  3. Gradient Descent ◮ choose initial w (0) , repeat Two dimensional example: w ( t +1) = w ( t ) − η t · ∇ F ( w ( t ) ) until stop ◮ η t is the learning rate, and ∇ F ( w ( t ) ) = 1 � ∇ w f ( w ( t ) ; y i , x i ) n i ◮ How to stop? � w ( t +1) − w ( t ) � ≤ ǫ or �∇ F ( w ( t ) ) � ≤ ǫ

  4. Learning rate matters too small η t , after 100 η t = t , it is too big iterations

  5. Backtracking line search Adaptively choose the learning rate ◮ choose a parameter 0 < β < 1 ◮ start with η = 1, repeat t = 0 , 1 , . . . ◮ while L ( w ( t ) − η ∇ L ( w ( t ) )) > L ( w ( t ) ) − η 2 �∇ L ( w ( t ) ) � 2 update η = βη ◮ w ( t +1) = w ( t ) − η ∇ L ( w ( t ) )

  6. Backtracking line search A typical choice β = 0 . 8, converged after 13 iterations:

  7. Stochastic Gradient Descent ◮ We name 1 � i f ( w ; y i , x i ) the empirical loss, the thing we n hope to minimize is the expected loss f ( w ) = E y i , x i f ( w ; y i , x i ) ◮ Suppose we receive an infinite stream of samples ( y t , x t ) from the distribution, one way to optimize the objective is w ( t +1) = w ( t ) − η t ∇ w f ( w ( t ) ; y t , x t ) ◮ On practice, we simulate the stream by randomly pick up ( y t , x t ) from the samples we have ◮ Comparing the average gradient of GD 1 i ∇ w f ( w ( t ) ; y i , x i ) � n

  8. More about SGD ◮ the objective does not always decrease for each step ◮ comparing to GD, SGD needs more steps, but each step is cheaper ◮ mini-batch, say pick up 100 samples and do average, may accelerate the convergence

  9. Relation to Perceptron ◮ Recall Perceptron: initialize w , repeat � y i x i if y i � w , x i � < 0 w = w + 0 otherwise ◮ Fix learning rate η = 1, let f ( w ; y , x ) = max(0 , − y i � w , x i � ), then � − y i x i if y i � w , x i � < 0 ∇ w f ( w ; y , x ) = 0 otherwise we derive Perceptron from SGD

  10. Question?

Recommend


More recommend