calculating hypergradient
play

Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1 - PowerPoint PPT Presentation

Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1 Table of Contents Background Bilevel optimization Forward and Reverse Gradient-Based Hyperparameter Optimization Conclusion Q & A 2 Background Hyperparameter


  1. Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1

  2. Table of Contents Background Bilevel optimization Forward and Reverse Gradient-Based Hyperparameter Optimization Conclusion Q & A 2

  3. Background

  4. Hyperparameter Optimization Tradeoff parameter • The dataset is split in two: S train and S test . • Suppose we add ℓ 2 norm as the regulation term, then arg min loss( S test , X ( λ )) (1) λ ∈D loss( S train , x ) + e λ � x � 2 . s . t . X ( λ ) ∈ arg min x ∈ R p Stepsize For gradient descent with momentum: v t = µ v t − 1 + ∇ J t ( w t − 1 ) , w t = w t − 1 − η ( µ v t − 1 − ∇ J t ( w t − 1 )) . Hyperparameters are µ and η . 3

  5. Group Lasso Traditional Group Lasso To seduce the group sparse effect of parameter w , we do L 1 2 � y − Xw � 2 + λ � w ∈ arg min ˆ � w G l � 2 , (2) w ∈ R p l =1 where we partition features in L groups {G 1 , G 2 , . . . , G L } . • But we need to do the partition by ourself beforehand. • How to learn the partition? 4

  6. Group Lasso • Encapsulate the group structure by an hyperparameter θ = [ θ 1 , θ 2 , . . . , θ L ] ∈ { 0 , 1 } P × L , where L is max number of groups and P is the number of features. • θ p , l = 1 if the p -th feature belongs to the l -th group, and 0 otherwise. Formulations for learning θ : ˆ θ ∈ θ ∈∈{ 0 , 1 } P × L C ( ˆ arg min w ( θ )) , (3) where C ( ˆ w ( θ )) can be the validation function ′ − X 2 � � ′ w w ( θ )) = 1 C ( ˆ , and � y � � 2 � L 1 2 � y − Xw � 2 + λ � w ( θ ) = arg min ˆ � θ l ⊙ w � 2 (4) w ∈ R P × L l =1 5

  7. Bilevel optimization

  8. Bilevel Optimization We can conclude the following optimization problem: f U ( x , y ) min x y ∈ arg min y ′ f L ( x , y ′ ) , s . t . (5) • f U is the upper-level objective, over two variables x and y . • f L is the lower-level objective, which binds y as a function of x . • (5) can be simply viewed as a special case of constrained optimization. • If we can get the analytic solution y ∗ ( x ) of y , then we just need to solve the single-level problem min x f U ( x , y ∗ ( x )). 6

  9. Gradient Compute the gradient of the solution to the lower-level problem with respect to variables in the upper-level problem: � ∂ f U ∂ x + ∂ f U � ∂ y x = x − η | ( x , y ∗ ) . (6) ∂ y ∂ x How to calculate ∂ y ∂ x ? Theorem Let f : R × R → R be a continuous function with first and second derivatives. Let g ( x ) = arg min y f ( x , y ). Then the derivative of g with respect to x is dg ( x ) = − f XY ( x , g ( x )) f YY ( x , g ( x )) . (7) dx ∂ x ∂ y and f YY = ∂ 2 f ∂ 2 f where f XY = ∂ y 2 , 7

  10. Proof 1. Since g ( x ) = arg min y f ( x , y ), we get ∂ f ( x , y ) | y = g ( x ) = 0; ∂ y ∂ f ( x , g ( x )) d 2. Differentiating lhs and rhs, we get = 0; dx ∂ y 3. While by the chain rule, we have = ∂ 2 f ( x , g ( x )) + ∂ 2 f ( x , g ( x )) ∂ f ( x , g ( x )) dg ( x ) d ; (8) ∂ y 2 dx ∂ y ∂ x ∂ y dx Equating to zero and rearranging gives: � − 1 ∂ 2 f ( x , g ( x )) � ∂ 2 f ( x , g ( x )) dg ( x ) = (9) ∂ y 2 dx ∂ x ∂ y − f XY ( x , g ( x )) = (10) f YY ( x , g ( x )) . 8

  11. Lemma Lemma 1 Let f : R × R ⋉ → R be a continuous function with first and second derivatives. Let g ( x ) = arg min y ∈ R n f ( x , y ). Then the derivative of g with respect to x is ′ ( x ) = − f XY ( x , g ( x )) − 1 f YY ( x , g ( x )) . (11) g yy f ( x , y ) ∈ R n × n and f YY = where f XY = ∇ 2 ∂ ∂ x ∇ y f ( x , y ) ∈ R n , 9

  12. Application to hyperparameter optimization (icml 16) Hyperparameter optimization arg min loss( S test , X ( λ )) (12) λ ∈D loss( S train , x ) + e λ � x � 2 . s . t . X ( λ ) ∈ arg min x ∈ R p Gradient descent for bilevel problem � ∂ f U ∂ x + ∂ f U ∂ y � x = x − η | ( x , y ∗ ) (13) ∂ y ∂ x � − 1 ∂ 2 f ( x , g ( x )) � � ∂ f U ∂ x − ∂ f U � ∂ 2 f ( x , g ( x )) = x − η (14) ∂ y 2 ∂ y ∂ x ∂ y Gradient � − 1 ∇ 1 g � T � ∇ 2 ∇ 2 � ∇ f = ∇ 2 g − 1 , 2 h 1 h 10

  13. HOAG 11

  14. Analysis Conclusion • If the sequence { ǫ i } ∞ i =1 is summable, then this implies the convergence to a stationary point of f . Theorem If the sequence { ǫ i } ∞ i =1 is positive and verifies ∞ � ǫ i < ∞ , i =1 then the sequence λ k of iterates in the HOAG algorithm has limit λ ∗ ∈ D . In particular, if λ ∗ belongs to the interior of D , it is verified then ∇ f ( λ ∗ ) = 0 . 12

  15. Forward and Reverse Gradient-Based Hyperparameter Optimization

  16. Formulation I • Focus on training procedures of an objective function J ( w ) with respect to w . • The training procedures of SGD or its variants like momentum, RMSProp and ADAM can be regarded as a dynamical system with a state s t ∈ R d . s t = Φ t ( s t − 1 , λ ) , t = 1 , . . . , T • For gradient descent with momentum: = µ v t − 1 + ∇ J t ( w t − 1 ) , v t w t = w t − 1 − η ( µ v t − 1 − ∇ J t ( w t − 1 )) . 1. s t = ( w t , v t ) , s t ∈ R d . • 2. λ = ( µ, η ), λ ∈ R m . 3. Φ t : ( R d × R m ) → R d . 13

  17. Formulation II • The iterates s 1 , . . . , s T implicitly depend on the vector of hyperparameters λ . • Goal: optimize the hyperparameters according to a certain error function E evaluated at the last iterate s T . • We wish to solve the problem min λ ∈ Λ f ( λ ) , where the set Λ ⊂ R m incorporates constraints on the hyperparameters. • The response function f : R m → R , defined at λ ∈ R m f ( λ ) = E ( s T ( λ )) . 14

  18. Diagram Figure 1: The iterates s 1 , . . . , s T depend on the hyperparameters λ • Change the bilevel program to use the parameters at the last iterate s T rather than ˆ w , min λ ∈ Λ f ( λ ) , where f ( λ ) = E ( s T ( λ )) . • The hypergradient ∇ f ( λ ) = ∇ E ( s T ) d s T d λ . 15

  19. Forward-Mode to calculate hypergradient • Chain rule: ∇ f ( λ ) = ∇ E ( s T ) d s T d λ , where d s T d λ is the d × m matrix. • Sine s t = Φ t ( s t − 1 , λ ), Φ t depends on λ both directly and indirectly through the state s t − 1 : d s t d λ = ∂ Φ t ( s t − 1 , λ ) d s t − 1 + ∂ Φ t ( s t − 1 , λ ) . ∂ s t − 1 d λ ∂ λ • Defining Z t = d s t d λ , we rewrite it as Z t = A t Z t − 1 + B t , t ∈ { 1 , . . . , T } . 16

  20. Forward-mode Recurrence Figure 2: Recurrence 17

  21. Forward-HG algorithm Figure 3: Forward-HG algorithm 18

  22. Reverse-Mode to calculate hypergradient Reformulate original problem as the constrained opt problem min E ( s T ) , λ, s 1 ,..., s T s . t . s t = Φ t ( s t − 1 , λ ) , t ∈ { 1 , . . . , T } . Lagrangian T � L ( s , λ, α ) = E ( s T ) + α t (Φ t ( s t − 1 , λ ) − s t ) . t =1 19

  23. Partial derivation of the Lagrangian 20

  24. Derivations Notation A t = ∂ Φ t ( s t − 1 , λ ) , B t = ∂ Φ t ( s t − 1 , λ ) , ∂ s t − 1 ∂λ note that A t ∈ R d × d and B t ∈ R d × m . ∂ L ∂ L ∂ s t = 0 and ∂ s T = 0 � ∇ E ( s T ) if t = T , α t = (15) ∇ E ( s T ) A T · · · A t +1 if t ∈ { 1 , . . . , T − 1 } . T Since ∂ L ∂λ = � α t B t , t =1 T � T � ∂ L � � ∂λ = ∇ E ( s T ) B t . A s t =1 s = t +1 21

  25. Reverse-HG algorithm Figure 4: Reverse-HG algorithm TRUNCATED BACK-PROPAGATION t = T − 1 to T − k . 22

  26. Real-Time HO • For t ∈ { 1 , . . . , T } , define f t ( λ ) = E ( s t ( λ )) . • Partial hypergradients are avaliable in forward mode ∇ f t ( λ ) = d E ( s t ) = ∇ E ( s t ) Z t . d λ • Significant: we can update hyperparameters in a single epoch, without having to wait until time T . Figure 5: The iterates s 1 , . . . , s T depend on the hyperparameters λ 23

  27. Real-Time HO algorithm Figure 6: Real-Time HO algorithm 24

  28. Analysis • Forward and inverse mode have different time/space tradeoffs. • Reverse mode needs to store the whole history of parameters. • Forward mode need to calculate mat multipy mat in each step. 25

  29. Conclusion

  30. Conclusions • Calculating the hypergradients, the gradients with respect to hyperparameters, is very important in selecting a good hyperparameter. • We talk about two ways for calculating hypergradients: bilevel optimization and forward/inverse mode. • In bilevel optimization, we suppose an optimal solutions set of lower level function; while in forward/inverse mode, we cosider the whole process of the lower level iterations. • Calculating hypergradients in bilevel optimization invovles solving the lower level problem and two second-order derivatives, both are very heavy cost. • Forward/inverse mode uses chain rule, just like for deep nets training. 26

  31. Q & A

Recommend


More recommend