Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1

Table of Contents Background Bilevel optimization Forward and Reverse Gradient-Based Hyperparameter Optimization Conclusion Q & A 2

Background

Hyperparameter Optimization Tradeoff parameter • The dataset is split in two: S train and S test . • Suppose we add ℓ 2 norm as the regulation term, then arg min loss( S test , X ( λ )) (1) λ ∈D loss( S train , x ) + e λ � x � 2 . s . t . X ( λ ) ∈ arg min x ∈ R p Stepsize For gradient descent with momentum: v t = µ v t − 1 + ∇ J t ( w t − 1 ) , w t = w t − 1 − η ( µ v t − 1 − ∇ J t ( w t − 1 )) . Hyperparameters are µ and η . 3

Group Lasso Traditional Group Lasso To seduce the group sparse effect of parameter w , we do L 1 2 � y − Xw � 2 + λ � w ∈ arg min ˆ � w G l � 2 , (2) w ∈ R p l =1 where we partition features in L groups {G 1 , G 2 , . . . , G L } . • But we need to do the partition by ourself beforehand. • How to learn the partition? 4

Group Lasso • Encapsulate the group structure by an hyperparameter θ = [ θ 1 , θ 2 , . . . , θ L ] ∈ { 0 , 1 } P × L , where L is max number of groups and P is the number of features. • θ p , l = 1 if the p -th feature belongs to the l -th group, and 0 otherwise. Formulations for learning θ : ˆ θ ∈ θ ∈∈{ 0 , 1 } P × L C ( ˆ arg min w ( θ )) , (3) where C ( ˆ w ( θ )) can be the validation function ′ − X 2 � � ′ w w ( θ )) = 1 C ( ˆ , and � y � � 2 � L 1 2 � y − Xw � 2 + λ � w ( θ ) = arg min ˆ � θ l ⊙ w � 2 (4) w ∈ R P × L l =1 5

Bilevel optimization

Bilevel Optimization We can conclude the following optimization problem: f U ( x , y ) min x y ∈ arg min y ′ f L ( x , y ′ ) , s . t . (5) • f U is the upper-level objective, over two variables x and y . • f L is the lower-level objective, which binds y as a function of x . • (5) can be simply viewed as a special case of constrained optimization. • If we can get the analytic solution y ∗ ( x ) of y , then we just need to solve the single-level problem min x f U ( x , y ∗ ( x )). 6

Gradient Compute the gradient of the solution to the lower-level problem with respect to variables in the upper-level problem: � ∂ f U ∂ x + ∂ f U � ∂ y x = x − η | ( x , y ∗ ) . (6) ∂ y ∂ x How to calculate ∂ y ∂ x ? Theorem Let f : R × R → R be a continuous function with first and second derivatives. Let g ( x ) = arg min y f ( x , y ). Then the derivative of g with respect to x is dg ( x ) = − f XY ( x , g ( x )) f YY ( x , g ( x )) . (7) dx ∂ x ∂ y and f YY = ∂ 2 f ∂ 2 f where f XY = ∂ y 2 , 7

Proof 1. Since g ( x ) = arg min y f ( x , y ), we get ∂ f ( x , y ) | y = g ( x ) = 0; ∂ y ∂ f ( x , g ( x )) d 2. Differentiating lhs and rhs, we get = 0; dx ∂ y 3. While by the chain rule, we have = ∂ 2 f ( x , g ( x )) + ∂ 2 f ( x , g ( x )) ∂ f ( x , g ( x )) dg ( x ) d ; (8) ∂ y 2 dx ∂ y ∂ x ∂ y dx Equating to zero and rearranging gives: � − 1 ∂ 2 f ( x , g ( x )) � ∂ 2 f ( x , g ( x )) dg ( x ) = (9) ∂ y 2 dx ∂ x ∂ y − f XY ( x , g ( x )) = (10) f YY ( x , g ( x )) . 8

Lemma Lemma 1 Let f : R × R ⋉ → R be a continuous function with first and second derivatives. Let g ( x ) = arg min y ∈ R n f ( x , y ). Then the derivative of g with respect to x is ′ ( x ) = − f XY ( x , g ( x )) − 1 f YY ( x , g ( x )) . (11) g yy f ( x , y ) ∈ R n × n and f YY = where f XY = ∇ 2 ∂ ∂ x ∇ y f ( x , y ) ∈ R n , 9

Application to hyperparameter optimization (icml 16) Hyperparameter optimization arg min loss( S test , X ( λ )) (12) λ ∈D loss( S train , x ) + e λ � x � 2 . s . t . X ( λ ) ∈ arg min x ∈ R p Gradient descent for bilevel problem � ∂ f U ∂ x + ∂ f U ∂ y � x = x − η | ( x , y ∗ ) (13) ∂ y ∂ x � − 1 ∂ 2 f ( x , g ( x )) � � ∂ f U ∂ x − ∂ f U � ∂ 2 f ( x , g ( x )) = x − η (14) ∂ y 2 ∂ y ∂ x ∂ y Gradient � − 1 ∇ 1 g � T � ∇ 2 ∇ 2 � ∇ f = ∇ 2 g − 1 , 2 h 1 h 10

HOAG 11

Analysis Conclusion • If the sequence { ǫ i } ∞ i =1 is summable, then this implies the convergence to a stationary point of f . Theorem If the sequence { ǫ i } ∞ i =1 is positive and verifies ∞ � ǫ i < ∞ , i =1 then the sequence λ k of iterates in the HOAG algorithm has limit λ ∗ ∈ D . In particular, if λ ∗ belongs to the interior of D , it is verified then ∇ f ( λ ∗ ) = 0 . 12

Forward and Reverse Gradient-Based Hyperparameter Optimization

Formulation I • Focus on training procedures of an objective function J ( w ) with respect to w . • The training procedures of SGD or its variants like momentum, RMSProp and ADAM can be regarded as a dynamical system with a state s t ∈ R d . s t = Φ t ( s t − 1 , λ ) , t = 1 , . . . , T • For gradient descent with momentum: = µ v t − 1 + ∇ J t ( w t − 1 ) , v t w t = w t − 1 − η ( µ v t − 1 − ∇ J t ( w t − 1 )) . 1. s t = ( w t , v t ) , s t ∈ R d . • 2. λ = ( µ, η ), λ ∈ R m . 3. Φ t : ( R d × R m ) → R d . 13

Formulation II • The iterates s 1 , . . . , s T implicitly depend on the vector of hyperparameters λ . • Goal: optimize the hyperparameters according to a certain error function E evaluated at the last iterate s T . • We wish to solve the problem min λ ∈ Λ f ( λ ) , where the set Λ ⊂ R m incorporates constraints on the hyperparameters. • The response function f : R m → R , defined at λ ∈ R m f ( λ ) = E ( s T ( λ )) . 14

Diagram Figure 1: The iterates s 1 , . . . , s T depend on the hyperparameters λ • Change the bilevel program to use the parameters at the last iterate s T rather than ˆ w , min λ ∈ Λ f ( λ ) , where f ( λ ) = E ( s T ( λ )) . • The hypergradient ∇ f ( λ ) = ∇ E ( s T ) d s T d λ . 15

Forward-Mode to calculate hypergradient • Chain rule: ∇ f ( λ ) = ∇ E ( s T ) d s T d λ , where d s T d λ is the d × m matrix. • Sine s t = Φ t ( s t − 1 , λ ), Φ t depends on λ both directly and indirectly through the state s t − 1 : d s t d λ = ∂ Φ t ( s t − 1 , λ ) d s t − 1 + ∂ Φ t ( s t − 1 , λ ) . ∂ s t − 1 d λ ∂ λ • Defining Z t = d s t d λ , we rewrite it as Z t = A t Z t − 1 + B t , t ∈ { 1 , . . . , T } . 16

Forward-mode Recurrence Figure 2: Recurrence 17

Forward-HG algorithm Figure 3: Forward-HG algorithm 18

Reverse-Mode to calculate hypergradient Reformulate original problem as the constrained opt problem min E ( s T ) , λ, s 1 ,..., s T s . t . s t = Φ t ( s t − 1 , λ ) , t ∈ { 1 , . . . , T } . Lagrangian T � L ( s , λ, α ) = E ( s T ) + α t (Φ t ( s t − 1 , λ ) − s t ) . t =1 19

Partial derivation of the Lagrangian 20

Derivations Notation A t = ∂ Φ t ( s t − 1 , λ ) , B t = ∂ Φ t ( s t − 1 , λ ) , ∂ s t − 1 ∂λ note that A t ∈ R d × d and B t ∈ R d × m . ∂ L ∂ L ∂ s t = 0 and ∂ s T = 0 � ∇ E ( s T ) if t = T , α t = (15) ∇ E ( s T ) A T · · · A t +1 if t ∈ { 1 , . . . , T − 1 } . T Since ∂ L ∂λ = � α t B t , t =1 T � T � ∂ L � � ∂λ = ∇ E ( s T ) B t . A s t =1 s = t +1 21

Reverse-HG algorithm Figure 4: Reverse-HG algorithm TRUNCATED BACK-PROPAGATION t = T − 1 to T − k . 22

Real-Time HO • For t ∈ { 1 , . . . , T } , define f t ( λ ) = E ( s t ( λ )) . • Partial hypergradients are avaliable in forward mode ∇ f t ( λ ) = d E ( s t ) = ∇ E ( s t ) Z t . d λ • Significant: we can update hyperparameters in a single epoch, without having to wait until time T . Figure 5: The iterates s 1 , . . . , s T depend on the hyperparameters λ 23

Real-Time HO algorithm Figure 6: Real-Time HO algorithm 24

Analysis • Forward and inverse mode have different time/space tradeoffs. • Reverse mode needs to store the whole history of parameters. • Forward mode need to calculate mat multipy mat in each step. 25

Conclusion

Conclusions • Calculating the hypergradients, the gradients with respect to hyperparameters, is very important in selecting a good hyperparameter. • We talk about two ways for calculating hypergradients: bilevel optimization and forward/inverse mode. • In bilevel optimization, we suppose an optimal solutions set of lower level function; while in forward/inverse mode, we cosider the whole process of the lower level iterations. • Calculating hypergradients in bilevel optimization invovles solving the lower level problem and two second-order derivatives, both are very heavy cost. • Forward/inverse mode uses chain rule, just like for deep nets training. 26

Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1 - PowerPoint PPT Presentation

Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1 Table of Contents Background Bilevel optimization Forward and Reverse Gradient-Based Hyperparameter Optimization Conclusion Q & A 2 Background Hyperparameter

On the Iteration Complexity of Hypergradient Computation Riccardo Grazzi Computational Statistics

Calculating distributions Chung-chieh Shan Indiana University 2018-09-21 Calculating

Calculating Derivatives There are two types of formulas for calculating derivatives, which we may

Calculating Derivatives There are two types of formulas for calculating derivatives, which we may

Reading, writing and calculating in the kitchen 1 Reading, writing and calculating in the kitchen

The Nitty-Gritty of Calculating Your Production Costs by Dale Lattz and Gary Schnitkey 1

Statically Calculating Secondary Thread Statically Calculating Secondary Thread Performance in

Method for analytically calculating BER (bit error rate) in presence of non-linearity Gaurav

Calculating New Mexicos Health Care Needs Paul B. Roth, MD, MS Chancellor for Health Sciences

Calculating A k using Fulmers Method Rasheen Alexander, Katie Huston, Thomas Le, Camera Whicker

Office of Environmental Health Office of Environmental Health Hazard Assessment (OEHHA) Hazard

Retail ROI Calculating the Value of Remote Access Sam Heiney Product Solutions Director Netop

Calculating MIRR 0 1 2 3 4 10% -260.0

Non-constant Non-constant growth model growth model You are calculating the intrinsic value of

What Youll Wh ll Learn arn Today ay Ca Calculating the he PV PV of a a stream of

Calculating with string diagrams Ross Street Macquarie University Workshop on Diagrammatic

Hyperparameter optimization strategies git clone

Towards efficient automatic end-to-end learning Frank Hutter University of Freiburg, Germany

Improving Bug Prediction Accuracy by Regularization and Hyperparameter Optimization Haidar Osman

Scalable Bandit Methods for Hyper-parameter Tuning Kirthevasan Kandasamy Carnegie Mellon

Separating hyperplanes S a closed, convex set Point x not in S ==> strict

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Optimal separating hyperplane. Basis expansion. Kernel trick. Support vector machine. Petr Po

Natural Language Processing and Information Retrieval Support Vector Machines Alessandro

Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1 - PowerPoint PPT Presentation

Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1 Table of Contents Background Bilevel optimization Forward and Reverse Gradient-Based Hyperparameter Optimization Conclusion Q & A 2 Background Hyperparameter

On the Iteration Complexity of Hypergradient Computation Riccardo Grazzi Computational Statistics

Calculating distributions Chung-chieh Shan Indiana University 2018-09-21 Calculating

Calculating Derivatives There are two types of formulas for calculating derivatives, which we may

Calculating Derivatives There are two types of formulas for calculating derivatives, which we may

Reading, writing and calculating in the kitchen 1 Reading, writing and calculating in the kitchen

The Nitty-Gritty of Calculating Your Production Costs by Dale Lattz and Gary Schnitkey 1

Statically Calculating Secondary Thread Statically Calculating Secondary Thread Performance in

Method for analytically calculating BER (bit error rate) in presence of non-linearity Gaurav

Calculating New Mexicos Health Care Needs Paul B. Roth, MD, MS Chancellor for Health Sciences

Calculating A k using Fulmers Method Rasheen Alexander, Katie Huston, Thomas Le, Camera Whicker

Office of Environmental Health Office of Environmental Health Hazard Assessment (OEHHA) Hazard

Retail ROI Calculating the Value of Remote Access Sam Heiney Product Solutions Director Netop

Calculating MIRR 0 1 2 3 4 10% -260.0

Non-constant Non-constant growth model growth model You are calculating the intrinsic value of

What Youll Wh ll Learn arn Today ay Ca Calculating the he PV PV of a a stream of

Calculating with string diagrams Ross Street Macquarie University Workshop on Diagrammatic

Hyperparameter optimization strategies git clone

Towards efficient automatic end-to-end learning Frank Hutter University of Freiburg, Germany

Improving Bug Prediction Accuracy by Regularization and Hyperparameter Optimization Haidar Osman

Scalable Bandit Methods for Hyper-parameter Tuning Kirthevasan Kandasamy Carnegie Mellon

Separating hyperplanes S a closed, convex set Point x not in S ==&gt; strict

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Optimal separating hyperplane. Basis expansion. Kernel trick. Support vector machine. Petr Po

Natural Language Processing and Information Retrieval Support Vector Machines Alessandro

Separating hyperplanes S a closed, convex set Point x not in S ==> strict