surrogate losses for online learning of stepsizes in
play

Surrogate Losses for Online Learning of Stepsizes in Stochastic - PowerPoint PPT Presentation

Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization Zhenxun Zhuang 1 , Ashok Cutkosky 2 , Francesco Orabona 1 , 3 1 Department of Computer Science, Boston University 2 Google 3 Department of Electrical &


  1. Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization Zhenxun Zhuang 1 , Ashok Cutkosky 2 , Francesco Orabona 1 , 3 1 Department of Computer Science, Boston University 2 Google 3 Department of Electrical & Computer Engineering, Boston University 1 / 10

  2. Convex vs. Non-Convex Functions A Convex Function A Non-Convex Function Stationary points: �∇ f ( x ) � = 0 2 / 10

  3. Gradient Descent vs. Stochastic Gradient Descent Gradient Descent: x t +1 = x t − η t ∇ f ( x t ) x t +1 = x t − η t g ( x t , ξ t ) E t [ g ( x t , ξ t )] = ∇ f ( x t ) SGD: with 3 / 10

  4. Curse of Constant Stepsize • Ghadimi & Lan (2013): running SGD on M -smooth functions with � � g ( x t , ξ t ) − ∇ f ( x t ) � 2 � ≤ σ 2 yields 1 η ≤ M and assuming E t � f ( x 1 ) − f ⋆ � E [ �∇ f ( x i ) � 2 ] ≤ O + ησ 2 . η T • Ward et al. (2018) and Li & Orabona (2019) eliminated the need to know f ⋆ and σ for getting optimal rate by AdaGrad global stepsizes. 4 / 10

  5. Transform Non-Convexity to Convexity by Surrogate Losses When the objective function is M -smooth, drawing two independent stochastic gradients in each round of SGD, we have ( assume for now η t only depends on past gradients ) : � � �∇ f ( x t ) , x t +1 − x t � + M 2 � x t +1 − x t � 2 E [ f ( x t +1 ) − f ( x t )] ≤ E � � �∇ f ( x t ) , − η t g ( x t , ξ t ) � + M 2 η 2 t � g ( x t , ξ t ) � 2 = E � � t ) � + M η 2 � g ( x t , ξ t ) � 2 t = E − η t � g ( x t , ξ t ) , g ( x t , ξ ′ . 2 5 / 10

  6. Transform Non-Convexity to Convexity by Surrogate Losses We define the surrogate loss for f at round t as t ) � + M η 2 � g ( x t , ξ t ) � 2 . ℓ t ( η ) � − η � g ( x t , ξ t ) , g ( x t , ξ ′ 2 The inequality of last page becomes E [ f ( x t +1 ) − f ( x t )] ≤ E [ ℓ t ( η t )] , which, after summing from t = 1 to T gives us: T T � � f ⋆ − f ( x 1 ) ≤ E [ ℓ t ( η t ) − ℓ t ( η )] + E [ ℓ t ( η )] . t =1 t =1 � �� � � �� � Regret of η t wrt optimal η Cumulative loss of optimal η 6 / 10

  7. SGD with Online Learning Algorithm 1 Stochastic Gradient Descent with Online Learning (SGDOL) 1: Input: x 1 ∈ X , M , an online learning algorithm A 2: for t = 1 , 2 , . . . , T do Compute η t by running A on 3: i ) � + M η 2 2 � g ( x i , ξ i ) � 2 , ℓ i ( η ) = − η � g ( x i , ξ i ) , g ( x i , ξ ′ i = 1 , . . . , t − 1 two independent unbiased estimates of ∇ f ( x t ): Receive 4: g ( x t , ξ t ) , g ( x t , ξ ′ t ) Update x t +1 = x t − η t g t 5: 6: end for 7: Output : uniformly randomly choose a x k from x 1 , . . . , x T . 7 / 10

  8. Main Theorem Theorem 1: Assume some conditions, and make some choice of the online learning algorithm in Algorithm 1, for a smooth function and an uniformly randomly picked x k from x 1 , . . . , x T , we have: � 1 � σ � �∇ f ( x k ) � 2 � ≤ ˜ O T + √ , E T where ˜ O hides some logarithmic factors. 8 / 10

  9. Classification Problem � m θ 2 1 i =1 φ ( a ⊤ Objective Function: i x − y i ) with φ ( θ ) = 1+ θ 2 on the m adult (a9a) training dataset. 9 / 10

  10. 10 / 10

Recommend


More recommend