On the Iteration Complexity of Hypergradient Computation Riccardo Grazzi Computational Statistics and Machine Learning, Istituto Italiano di Tecnologia. Department of Computer Science, University College London. riccardo.grazzi@iit.it Joint work with Luca Franceschi, Massimiliano Pontil and Saverio Salzo. 1
β’ Gradient-based methods exploiting the hypergradient βπ(π) . Bilevel Optimization Problem min (upper-level) π₯(π) βΆ= Ξ¦(π₯(π), π) (lower-level) β’ Hyperparameter optimization, meta-learning. β’ Graph and recurrent neural networks. How can we solve this optimization problem? β’ Black-box methods (random/grid search, Bayesian optimization, ...). 2 πβΞββ π π(π) βΆ= πΉ(π₯(π), π)
Bilevel Optimization Problem min (upper-level) π₯(π) βΆ= Ξ¦(π₯(π), π) (lower-level) β’ Hyperparameter optimization, meta-learning. β’ Graph and recurrent neural networks. How can we solve this optimization problem? β’ Black-box methods (random/grid search, Bayesian optimization, ...). 2 πβΞββ π π(π) βΆ= πΉ(π₯(π), π) β’ Gradient-based methods exploiting the hypergradient βπ(π) .
Computing the Hypergradient βπ(π) Can be really expensive or even impossible to compute Exactly. Two common approximation strategies are 1. Iterative Difgerentiation (ITD) . 2. Approximate Implicit Difgerentiation (AID) . Which one is the best? β’ Previous works provide mostly qualitative and empirical results. Can we have quantitative results on the approximation error? β’ Yes! If the fixed point map Ξ¦(β , π) is a contraction . 3
Computing the Hypergradient βπ(π) Can be really expensive or even impossible to compute Exactly. Two common approximation strategies are 1. Iterative Difgerentiation (ITD) . 2. Approximate Implicit Difgerentiation (AID) . Which one is the best? β’ Previous works provide mostly qualitative and empirical results. Can we have quantitative results on the approximation error? β’ Yes! If the fixed point map Ξ¦(β , π) is a contraction . 3
Our Contributions β’ If Ξ¦(β , π) is a contraction, the results confirm the theory. Upper bounds on the approximation error for both ITD and AID β’ If Ξ¦(β , π) is NOT a contraction, ITD can be still a reliable strategy. 4 Extensive experimental comparison among difgerent AID strategies and ITD β’ We prove that ITD is generally worse than AID in terms of upper bounds. β’ Both methods achieve non-asymptotic linear convergence rates . Logistic Regression Kernel Ridge Regression Biased Regularization Hyper Representation 9 10 7 10 3 10 β2 8 10 10 2 ||β f ( Ξ» ) β g ( Ξ» )|| 5 10 10 7 β3 10 10 1 10 6 3 10 10 β4 ITD 10 0 FP k = t 10 5 10 FP k = 10 β5 10 1 10 β1 CG k = t 10 4 10 CG k = 10 β6 10 0 250 500 750 1000 1250 0 50 100 150 200 0 25 50 75 100 125 150 0 100 200 300 400 500 t t t t
All can be cast into the same bilevel framework where at the lower-level we seek for the solution to a parametric fixed point equation. Motivation Source: S.Ravi, H. Larochelle (2016). β’ Hyperparameter optimization (learn the kernel/regularization, ...). β’ Meta-learning (MAML, L2LOpt, ...). Source: snap.stanford.edu/proj/embeddings-www β’ Graph Neural Networks. β’ Some Recurrent Models. β’ Deep Equilibrium Models. 5
All can be cast into the same bilevel framework where at the lower-level we seek for the solution to a parametric fixed point equation. Motivation Source: S.Ravi, H. Larochelle (2016). β’ Hyperparameter optimization (learn the kernel/regularization, ...). β’ Meta-learning (MAML, L2LOpt, ...). Source: snap.stanford.edu/proj/embeddings-www β’ Graph Neural Networks. β’ Some Recurrent Models. β’ Deep Equilibrium Models. 5
Motivation Source: S.Ravi, H. Larochelle (2016). β’ Hyperparameter optimization (learn the kernel/regularization, ...). β’ Meta-learning (MAML, L2LOpt, ...). Source: snap.stanford.edu/proj/embeddings-www β’ Graph Neural Networks. β’ Some Recurrent Models. β’ Deep Equilibrium Models. 5 All can be cast into the same bilevel framework where at the lower-level we seek for the solution to a parametric fixed point equation.
Example: Optimizing the Regularization Hyperparameter in Ridge Regression min If the step size π½ is suffjciently small, Ξ¦(β , π) is also a contraction . Ξ¦(π₯, π) = π₯ β π½β 1 β(π₯, π) π₯(π) is the unique fixed point of the one step GD map 2 } 2βππ₯ β π§β 2 6 π₯ββ π π₯(π) = arg min 2 2βπ val π₯(π) β π§ val β 2 1 πβ(0,β) {β(π₯, π) βΆ= 1 2 + π 2 βπ₯β 2
β’ βπ is even harder to evaluate. The Bilevel Framework min (upper-level) π₯(π) βΆ= Ξ¦(π₯(π), π) (lower-level) β’ π is usually non convex and expensive or impossible to evaluate exactly. 7 πβΞββ π π(π) βΆ= πΉ(π₯(π), π) β’ π₯(π) β β π is oΔ§ten not available in closed form.
The Bilevel Framework min (upper-level) π₯(π) βΆ= Ξ¦(π₯(π), π) (lower-level) β’ π is usually non convex and expensive or impossible to evaluate exactly. 7 πβΞββ π π(π) βΆ= πΉ(π₯(π), π) β’ π₯(π) β β π is oΔ§ten not available in closed form. β’ βπ is even harder to evaluate.
How to Compute the Hypergradient βπ(π) ? 2. Compute π€ π’,π (π) with π steps of a solver for Which one is the best? + π 2 Ξ¦(π₯ π’ (π), π) β€ π€ π’,π (π). βπ(π) βΆ=β 2 πΉ(π₯ π’ (π), π) Μ 3. Compute the approximate gradient as (π½ β π 1 Ξ¦(π₯ π’ (π), π) β€ )π€ = β 1 πΉ(π₯ π’ (π), π). the linear system 1. Get π₯ π’ (π) with π’ steps of a lower-level solver. Iterative Difgerentiaton (ITD) Approximate Implicit Difgerentiation (AID) difgerentiation. (RMAD) or forward (FMAD) mode automatic 3. Compute βπ π’ (π) effjciently using reverse 2. Compute π π’ (π) = πΉ(π₯ π’ (π), π) . for π = 1, 2, β¦ π’ 1. Set π₯ 0 (π) = 0 and compute, 8 β π₯ π (π) = Ξ¦(π₯ πβ1 (π), π).
How to Compute the Hypergradient βπ(π) ? 2. Compute π€ π’,π (π) with π steps of a solver for Which one is the best? + π 2 Ξ¦(π₯ π’ (π), π) β€ π€ π’,π (π). βπ(π) βΆ=β 2 πΉ(π₯ π’ (π), π) Μ 3. Compute the approximate gradient as (π½ β π 1 Ξ¦(π₯ π’ (π), π) β€ )π€ = β 1 πΉ(π₯ π’ (π), π). the linear system 1. Get π₯ π’ (π) with π’ steps of a lower-level solver. Iterative Difgerentiaton (ITD) Approximate Implicit Difgerentiation (AID) difgerentiation. (RMAD) or forward (FMAD) mode automatic 3. Compute βπ π’ (π) effjciently using reverse 2. Compute π π’ (π) = πΉ(π₯ π’ (π), π) . for π = 1, 2, β¦ π’ 1. Set π₯ 0 (π) = 0 and compute, 8 β π₯ π (π) = Ξ¦(π₯ πβ1 (π), π).
How to Compute the Hypergradient βπ(π) ? 2. Compute π€ π’,π (π) with π steps of a solver for Which one is the best? + π 2 Ξ¦(π₯ π’ (π), π) β€ π€ π’,π (π). βπ(π) βΆ=β 2 πΉ(π₯ π’ (π), π) Μ 3. Compute the approximate gradient as (π½ β π 1 Ξ¦(π₯ π’ (π), π) β€ )π€ = β 1 πΉ(π₯ π’ (π), π). the linear system 1. Get π₯ π’ (π) with π’ steps of a lower-level solver. Iterative Difgerentiaton (ITD) Approximate Implicit Difgerentiation (AID) difgerentiation. (RMAD) or forward (FMAD) mode automatic 3. Compute βπ π’ (π) effjciently using reverse 2. Compute π π’ (π) = πΉ(π₯ π’ (π), π) . for π = 1, 2, β¦ π’ 1. Set π₯ 0 (π) = 0 and compute, 8 β π₯ π (π) = Ξ¦(π₯ πβ1 (π), π).
A First Comparison ITD β’ Ignores the bilevel structure. β’ Cost in time (RMAD): π(Cost(π π’ (π))) β’ Cost in memory (RMAD): π(π’π) . β’ Can we control ββπ π’ (π) β βπ(π)β ? AID β’ Can use any lower-level solver. β’ Cost in time ( π = π’ ): π(Cost(π π’ (π))) . β’ Cost in memory: π(π) . βπ(π) β βπ(π)β ? π π’ (π) = πΉ(π₯ π’ (π), π) . 9 β’ Can we control β Μ
β’ We provide non-asymptotic upper β’ We provide non-asymptotic upper bounds on β Μ Previous Work on the Approximation Error (Pedregosa, 2016). π π’ (π) = πΉ(π₯ π’ (π), π) . βπ(π) β βπ(π)β . (Rajeswaran et al., 2019). regularization meta-learning with biased linear rate in π’ and π for β β β β βπ(π) β βπ(π)β β 10 ITD β β β β (Franceschi et al., 2018). bounds on ββπ π’ (π) β βπ(π)β . AID βπ(π) β βπ(π)β β β β β β’ arg min π π’ β π’ββ arg min π β’ β Μ π’,πββ 0 β’ β Μ π’,πββ 0 at a
Recommend
More recommend