on the iteration complexity of hypergradient computation
play

On the Iteration Complexity of Hypergradient Computation Riccardo - PowerPoint PPT Presentation

On the Iteration Complexity of Hypergradient Computation Riccardo Grazzi Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia. Department of Computer Science, University College London. riccardo.grazzi@iit.it Joint


  1. On the Iteration Complexity of Hypergradient Computation Riccardo Grazzi Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia. Department of Computer Science, University College London. riccardo.grazzi@iit.it Joint work with Luca Franceschi, Massimiliano Pontil and Saverio Salzo. 1

  2. β€’ Gradient-based methods exploiting the hypergradient βˆ‡π‘”(πœ‡) . Bilevel Optimization Problem min (upper-level) π‘₯(πœ‡) ∢= Ξ¦(π‘₯(πœ‡), πœ‡) (lower-level) β€’ Hyperparameter optimization, meta-learning. β€’ Graph and recurrent neural networks. How can we solve this optimization problem? β€’ Black-box methods (random/grid search, Bayesian optimization, ...). 2 πœ‡βˆˆΞ›βŠ†β„ π‘œ 𝑔(πœ‡) ∢= 𝐹(π‘₯(πœ‡), πœ‡)

  3. Bilevel Optimization Problem min (upper-level) π‘₯(πœ‡) ∢= Ξ¦(π‘₯(πœ‡), πœ‡) (lower-level) β€’ Hyperparameter optimization, meta-learning. β€’ Graph and recurrent neural networks. How can we solve this optimization problem? β€’ Black-box methods (random/grid search, Bayesian optimization, ...). 2 πœ‡βˆˆΞ›βŠ†β„ π‘œ 𝑔(πœ‡) ∢= 𝐹(π‘₯(πœ‡), πœ‡) β€’ Gradient-based methods exploiting the hypergradient βˆ‡π‘”(πœ‡) .

  4. Computing the Hypergradient βˆ‡π‘”(πœ‡) Can be really expensive or even impossible to compute Exactly. Two common approximation strategies are 1. Iterative Difgerentiation (ITD) . 2. Approximate Implicit Difgerentiation (AID) . Which one is the best? β€’ Previous works provide mostly qualitative and empirical results. Can we have quantitative results on the approximation error? β€’ Yes! If the fixed point map Ξ¦(β‹…, πœ‡) is a contraction . 3

  5. Computing the Hypergradient βˆ‡π‘”(πœ‡) Can be really expensive or even impossible to compute Exactly. Two common approximation strategies are 1. Iterative Difgerentiation (ITD) . 2. Approximate Implicit Difgerentiation (AID) . Which one is the best? β€’ Previous works provide mostly qualitative and empirical results. Can we have quantitative results on the approximation error? β€’ Yes! If the fixed point map Ξ¦(β‹…, πœ‡) is a contraction . 3

  6. Our Contributions β€’ If Ξ¦(β‹…, πœ‡) is a contraction, the results confirm the theory. Upper bounds on the approximation error for both ITD and AID β€’ If Ξ¦(β‹…, πœ‡) is NOT a contraction, ITD can be still a reliable strategy. 4 Extensive experimental comparison among difgerent AID strategies and ITD β€’ We prove that ITD is generally worse than AID in terms of upper bounds. β€’ Both methods achieve non-asymptotic linear convergence rates . Logistic Regression Kernel Ridge Regression Biased Regularization Hyper Representation 9 10 7 10 3 10 βˆ’2 8 10 10 2 ||βˆ‡ f ( Ξ» ) βˆ’ g ( Ξ» )|| 5 10 10 7 βˆ’3 10 10 1 10 6 3 10 10 βˆ’4 ITD 10 0 FP k = t 10 5 10 FP k = 10 βˆ’5 10 1 10 βˆ’1 CG k = t 10 4 10 CG k = 10 βˆ’6 10 0 250 500 750 1000 1250 0 50 100 150 200 0 25 50 75 100 125 150 0 100 200 300 400 500 t t t t

  7. All can be cast into the same bilevel framework where at the lower-level we seek for the solution to a parametric fixed point equation. Motivation Source: S.Ravi, H. Larochelle (2016). β€’ Hyperparameter optimization (learn the kernel/regularization, ...). β€’ Meta-learning (MAML, L2LOpt, ...). Source: snap.stanford.edu/proj/embeddings-www β€’ Graph Neural Networks. β€’ Some Recurrent Models. β€’ Deep Equilibrium Models. 5

  8. All can be cast into the same bilevel framework where at the lower-level we seek for the solution to a parametric fixed point equation. Motivation Source: S.Ravi, H. Larochelle (2016). β€’ Hyperparameter optimization (learn the kernel/regularization, ...). β€’ Meta-learning (MAML, L2LOpt, ...). Source: snap.stanford.edu/proj/embeddings-www β€’ Graph Neural Networks. β€’ Some Recurrent Models. β€’ Deep Equilibrium Models. 5

  9. Motivation Source: S.Ravi, H. Larochelle (2016). β€’ Hyperparameter optimization (learn the kernel/regularization, ...). β€’ Meta-learning (MAML, L2LOpt, ...). Source: snap.stanford.edu/proj/embeddings-www β€’ Graph Neural Networks. β€’ Some Recurrent Models. β€’ Deep Equilibrium Models. 5 All can be cast into the same bilevel framework where at the lower-level we seek for the solution to a parametric fixed point equation.

  10. Example: Optimizing the Regularization Hyperparameter in Ridge Regression min If the step size 𝛽 is suffjciently small, Ξ¦(β‹…, πœ‡) is also a contraction . Ξ¦(π‘₯, πœ‡) = π‘₯ βˆ’ π›½βˆ‡ 1 β„“(π‘₯, πœ‡) π‘₯(πœ‡) is the unique fixed point of the one step GD map 2 } 2β€–π‘Œπ‘₯ βˆ’ 𝑧‖ 2 6 π‘₯βˆˆβ„ 𝑒 π‘₯(πœ‡) = arg min 2 2β€–π‘Œ val π‘₯(πœ‡) βˆ’ 𝑧 val β€– 2 1 πœ‡βˆˆ(0,∞) {β„“(π‘₯, πœ‡) ∢= 1 2 + πœ‡ 2 β€–π‘₯β€– 2

  11. β€’ βˆ‡π‘” is even harder to evaluate. The Bilevel Framework min (upper-level) π‘₯(πœ‡) ∢= Ξ¦(π‘₯(πœ‡), πœ‡) (lower-level) β€’ 𝑔 is usually non convex and expensive or impossible to evaluate exactly. 7 πœ‡βˆˆΞ›βŠ†β„ π‘œ 𝑔(πœ‡) ∢= 𝐹(π‘₯(πœ‡), πœ‡) β€’ π‘₯(πœ‡) ∈ ℝ 𝑒 is oΔ§ten not available in closed form.

  12. The Bilevel Framework min (upper-level) π‘₯(πœ‡) ∢= Ξ¦(π‘₯(πœ‡), πœ‡) (lower-level) β€’ 𝑔 is usually non convex and expensive or impossible to evaluate exactly. 7 πœ‡βˆˆΞ›βŠ†β„ π‘œ 𝑔(πœ‡) ∢= 𝐹(π‘₯(πœ‡), πœ‡) β€’ π‘₯(πœ‡) ∈ ℝ 𝑒 is oΔ§ten not available in closed form. β€’ βˆ‡π‘” is even harder to evaluate.

  13. How to Compute the Hypergradient βˆ‡π‘”(πœ‡) ? 2. Compute 𝑀 𝑒,𝑙 (πœ‡) with 𝑙 steps of a solver for Which one is the best? + πœ– 2 Ξ¦(π‘₯ 𝑒 (πœ‡), πœ‡) ⊀ 𝑀 𝑒,𝑙 (πœ‡). βˆ‡π‘”(πœ‡) ∢=βˆ‡ 2 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) Μ‚ 3. Compute the approximate gradient as (𝐽 βˆ’ πœ– 1 Ξ¦(π‘₯ 𝑒 (πœ‡), πœ‡) ⊀ )𝑀 = βˆ‡ 1 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡). the linear system 1. Get π‘₯ 𝑒 (πœ‡) with 𝑒 steps of a lower-level solver. Iterative Difgerentiaton (ITD) Approximate Implicit Difgerentiation (AID) difgerentiation. (RMAD) or forward (FMAD) mode automatic 3. Compute βˆ‡π‘” 𝑒 (πœ‡) effjciently using reverse 2. Compute 𝑔 𝑒 (πœ‡) = 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) . for 𝑗 = 1, 2, … 𝑒 1. Set π‘₯ 0 (πœ‡) = 0 and compute, 8 ⌊ π‘₯ 𝑗 (πœ‡) = Ξ¦(π‘₯ π‘—βˆ’1 (πœ‡), πœ‡).

  14. How to Compute the Hypergradient βˆ‡π‘”(πœ‡) ? 2. Compute 𝑀 𝑒,𝑙 (πœ‡) with 𝑙 steps of a solver for Which one is the best? + πœ– 2 Ξ¦(π‘₯ 𝑒 (πœ‡), πœ‡) ⊀ 𝑀 𝑒,𝑙 (πœ‡). βˆ‡π‘”(πœ‡) ∢=βˆ‡ 2 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) Μ‚ 3. Compute the approximate gradient as (𝐽 βˆ’ πœ– 1 Ξ¦(π‘₯ 𝑒 (πœ‡), πœ‡) ⊀ )𝑀 = βˆ‡ 1 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡). the linear system 1. Get π‘₯ 𝑒 (πœ‡) with 𝑒 steps of a lower-level solver. Iterative Difgerentiaton (ITD) Approximate Implicit Difgerentiation (AID) difgerentiation. (RMAD) or forward (FMAD) mode automatic 3. Compute βˆ‡π‘” 𝑒 (πœ‡) effjciently using reverse 2. Compute 𝑔 𝑒 (πœ‡) = 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) . for 𝑗 = 1, 2, … 𝑒 1. Set π‘₯ 0 (πœ‡) = 0 and compute, 8 ⌊ π‘₯ 𝑗 (πœ‡) = Ξ¦(π‘₯ π‘—βˆ’1 (πœ‡), πœ‡).

  15. How to Compute the Hypergradient βˆ‡π‘”(πœ‡) ? 2. Compute 𝑀 𝑒,𝑙 (πœ‡) with 𝑙 steps of a solver for Which one is the best? + πœ– 2 Ξ¦(π‘₯ 𝑒 (πœ‡), πœ‡) ⊀ 𝑀 𝑒,𝑙 (πœ‡). βˆ‡π‘”(πœ‡) ∢=βˆ‡ 2 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) Μ‚ 3. Compute the approximate gradient as (𝐽 βˆ’ πœ– 1 Ξ¦(π‘₯ 𝑒 (πœ‡), πœ‡) ⊀ )𝑀 = βˆ‡ 1 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡). the linear system 1. Get π‘₯ 𝑒 (πœ‡) with 𝑒 steps of a lower-level solver. Iterative Difgerentiaton (ITD) Approximate Implicit Difgerentiation (AID) difgerentiation. (RMAD) or forward (FMAD) mode automatic 3. Compute βˆ‡π‘” 𝑒 (πœ‡) effjciently using reverse 2. Compute 𝑔 𝑒 (πœ‡) = 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) . for 𝑗 = 1, 2, … 𝑒 1. Set π‘₯ 0 (πœ‡) = 0 and compute, 8 ⌊ π‘₯ 𝑗 (πœ‡) = Ξ¦(π‘₯ π‘—βˆ’1 (πœ‡), πœ‡).

  16. A First Comparison ITD β€’ Ignores the bilevel structure. β€’ Cost in time (RMAD): 𝑃(Cost(𝑔 𝑒 (πœ‡))) β€’ Cost in memory (RMAD): 𝑃(𝑒𝑒) . β€’ Can we control β€–βˆ‡π‘” 𝑒 (πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– ? AID β€’ Can use any lower-level solver. β€’ Cost in time ( 𝑙 = 𝑒 ): 𝑃(Cost(𝑔 𝑒 (πœ‡))) . β€’ Cost in memory: 𝑃(𝑒) . βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– ? 𝑔 𝑒 (πœ‡) = 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) . 9 β€’ Can we control β€– Μ‚

  17. β€’ We provide non-asymptotic upper β€’ We provide non-asymptotic upper bounds on β€– Μ‚ Previous Work on the Approximation Error (Pedregosa, 2016). 𝑔 𝑒 (πœ‡) = 𝐹(π‘₯ 𝑒 (πœ‡), πœ‡) . βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– . (Rajeswaran et al., 2019). regularization meta-learning with biased linear rate in 𝑒 and 𝑙 for β†’ βˆ’ βˆ’ βˆ’ βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– βˆ’ 10 ITD β†’ βˆ’ βˆ’ β†’ (Franceschi et al., 2018). bounds on β€–βˆ‡π‘” 𝑒 (πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– . AID βˆ‡π‘”(πœ‡) βˆ’ βˆ‡π‘”(πœ‡)β€– βˆ’ βˆ’ βˆ’ βˆ’ β€’ arg min 𝑔 𝑒 βˆ’ π‘’β†’βˆž arg min 𝑔 β€’ β€– Μ‚ 𝑒,π‘™β†’βˆž 0 β€’ β€– Μ‚ 𝑒,π‘™β†’βˆž 0 at a

Recommend


More recommend