learning near optimal hyperparameters with minimal
play

Learning near-optimal hyperparameters with minimal overhead Gellrt - PowerPoint PPT Presentation

Learning near-optimal hyperparameters with minimal overhead Gellrt Weisz Andrs Gyrgy Csaba Szepesvri Workshop on Automated Algorithm Design (TTIC 2019) August 7, 2019 1 / 22 Introduction Problem: find good parameter settings


  1. Learning near-optimal hyperparameters with minimal overhead Gellért Weisz András György Csaba Szepesvári Workshop on Automated Algorithm Design (TTIC 2019) August 7, 2019 1 / 22

  2. Introduction Problem: find good parameter settings (configurations) for general purpose solvers. ◮ No structure assumed over the parameter space. 2 / 22

  3. Introduction Problem: find good parameter settings (configurations) for general purpose solvers. ◮ No structure assumed over the parameter space. Zillions of practical algorithms ⇔ Little theory Want theoretical guarantees on the runtime of ◮ the chosen configuration; and ◮ the configuration process. 2 / 22

  4. Introduction Problem: find good parameter settings (configurations) for general purpose solvers. ◮ No structure assumed over the parameter space. Zillions of practical algorithms ⇔ Little theory Want theoretical guarantees on the runtime of ◮ the chosen configuration; and ◮ the configuration process. Goal: find a near-optimal configuration solving 1 − δ fraction of the problems in the least expected time. ◮ Since some instances ( δ fraction) are hopelessly hard; don’t want to solve those. 2 / 22

  5. � pdf pdf expected capped runtime � � ( � ) tail probability = � Runtime of configuration � Runtime of configuration � Problem formulation Given: n configurations, distribution Γ of problem instances. 3 / 22

  6. � pdf expected capped runtime � � ( � ) Runtime of configuration � Problem formulation Given: n configurations, distribution Γ of problem instances. pdf tail probability = � Runtime of configuration � 3 / 22

  7. pdf tail probability = � Runtime of configuration � Problem formulation Given: n configurations, distribution Γ of problem instances. � pdf expected capped runtime � � ( � ) Runtime of configuration � 3 / 22

  8. pdf tail probability = � Runtime of configuration � Problem formulation Given: n configurations, distribution Γ of problem instances. � pdf Runtime of the optimal expected capped capped configuration: runtime � � ( � ) R δ ( i ) OPT δ = min i Runtime of configuration � 3 / 22

  9. pdf tail probability = � Runtime of configuration � Problem formulation Given: n configurations, distribution Γ of problem instances. � pdf Runtime of the optimal expected capped capped configuration: runtime � � ( � ) R δ ( i ) OPT δ = min i Runtime of configuration � Configuration i is ( ε, δ ) -optimal if R δ ( i ) ≤ (1 + ε )OPT δ/ 2 . 3 / 22

  10. pdf tail probability = � Runtime of configuration � Problem formulation Given: n configurations, distribution Γ of problem instances. � pdf Runtime of the optimal expected capped capped configuration: runtime � � ( � ) R δ ( i ) OPT δ = min i Runtime of configuration � Configuration i is ( ε, δ ) -optimal if R δ ( i ) ≤ (1 + ε )OPT δ/ 2 . Note that OPT δ ≤ OPT δ/ 2 ≤ OPT 0 – gaps can be large! 3 / 22

  11. Previous work (before ICML ’19) 4 / 22

  12. Structured Procrastination (Kleinberg et al., 2017) Relaxed goal: Find i with R δ ( i ) ≤ (1 + ε )OPT 0 OPT 0 n � � Worst-case lower bound: runtime must be at least Ω ε 2 δ With probability 1 − ζ , returns an ( ε, δ ) -optimal configuration in worst-case time � n � n log ¯ κ �� O OPT 0 ε 2 δ log ζε 2 δ ◮ κ : absolute upper bound on runtimes 5 / 22

  13. Structured Procrastination (Kleinberg et al., 2017) Relaxed goal: Find i with R δ ( i ) ≤ (1 + ε )OPT 0 OPT 0 n � � Worst-case lower bound: runtime must be at least Ω ε 2 δ With probability 1 − ζ , returns an ( ε, δ ) -optimal configuration in worst-case time � n � n log ¯ κ �� O OPT 0 ε 2 δ log ζε 2 δ ◮ κ : absolute upper bound on runtimes Can we remove ¯ κ ? Can we improve runtime when problem is easier? 5 / 22

  14. LEAPSANDBOUNDS (Weisz et al., 2018) Guess a value θ of OPT , starting from a low value 1 Test whether R δ ( i ) ≤ θ for some configuration i : 2 ◮ For each i , run b = ˜ O ( 1 δε 2 ) instances with instance-wise timeout τ = 4 θ 3 δ , abort if empirical average exceeds θ . 3 Return the configuration with the smallest mean amongst successful configurations. If no test succeeded, double θ , continue from Step 2 . Average runtime budget and its use across different configurations and phases 6 / 22

  15. Why does this work? w.h.p., for any configuration i : 7 / 22

  16. Why does this work? w.h.p., for any configuration i : ◮ if runs complete within θ average runtime: 7 / 22

  17. Why does this work? w.h.p., for any configuration i : ◮ if runs complete within θ average runtime: (i) τ is above δ -quantile for configuration i 7 / 22

  18. Why does this work? w.h.p., for any configuration i : ◮ if runs complete within θ average runtime: (i) τ is above δ -quantile for configuration i (ii) Empirical mean ¯ R i is ε -close to R τ ( i ) = E [ X ( i, J ) ∧ τ ] , J ∼ Γ 7 / 22

  19. Why does this work? w.h.p., for any configuration i : ◮ if runs complete within θ average runtime: (i) τ is above δ -quantile for configuration i (ii) Empirical mean ¯ R i is ε -close to R τ ( i ) = E [ X ( i, J ) ∧ τ ] , J ∼ Γ ◮ otherwise, R δ ( i ) > θ , hence can safely abandon i for this phase 7 / 22

  20. Why does this work? w.h.p., for any configuration i : ◮ if runs complete within θ average runtime: (i) τ is above δ -quantile for configuration i (ii) Empirical mean ¯ R i is ε -close to R τ ( i ) = E [ X ( i, J ) ∧ τ ] , J ∼ Γ ◮ otherwise, R δ ( i ) > θ , hence can safely abandon i for this phase R i < θ , then for i ∗ = argmin i ¯ Thus, if for any configuration i , ¯ R i , R δ ( i ∗ ) ≤ (1 + ε )OPT 0 w.h.p. 7 / 22

  21. Guarantees Theorem With high probability, (i) the algorithm finds an ( ε, δ ) -optimal configuration; � � �� OPT 0 n n log OPT 0 (ii) the worst case runtime is O ε 2 δ log . ζ 8 / 22

  22. Guarantees Theorem With high probability, (i) the algorithm finds an ( ε, δ ) -optimal configuration; � � �� OPT 0 n n log OPT 0 (ii) the worst case runtime is O ε 2 δ log . ζ Improvement: Empirical Bernstein stopping Stop testing a configuration i when confidence intervals already indicate that (a) i is not optimal with the given timeout; or (b) i is already estimated with ε accuracy. 8 / 22

  23. Guarantees Theorem With high probability, (i) the algorithm finds an ( ε, δ ) -optimal configuration; � � �� OPT 0 n n log OPT 0 (ii) the worst case runtime is O ε 2 δ log . ζ Improvement: Empirical Bernstein stopping Stop testing a configuration i when confidence intervals already indicate that (a) i is not optimal with the given timeout; or (b) i is already estimated with ε accuracy. Runtime: � � � � � � σ 2 � n log n log OPT 0 τk ( i ) , 1 εδ , 1 δ log 1 1 O i,k OPT 0 i =1 max + log . ε 2 R 2 δ ζ εR τk ( i ) 8 / 22

  24. Guarantees Theorem With high probability, (i) the algorithm finds an ( ε, δ ) -optimal configuration; � � �� OPT 0 n n log OPT 0 (ii) the worst case runtime is O ε 2 δ log . ζ Improvement: Empirical Bernstein stopping Stop testing a configuration i when confidence intervals already indicate that (a) i is not optimal with the given timeout; or (b) i is already estimated with ε accuracy. Runtime: � � � � � � σ 2 � n log n log OPT 0 τk ( i ) , 1 εδ , 1 δ log 1 1 O i,k OPT 0 i =1 max + log . ε 2 R 2 δ ζ εR τk ( i ) σ 2 τk ≪ 1 i,k Huge improvement if the variances are small: δ . R 2 8 / 22

  25. 10 3 delta=0 delta=0.05 Mean below delta quantile (s) delta=0.1 delta=0.2 10 2 delta=0.3 delta=0.5 10 1 10 0 0 200 400 600 800 1000 Configurations (sorted according to mean below delta quantile) Experiments Configuring the minisat SAT solver (Sorensson and Een, 2005) 1K configurations, 20K nontrivial problem instances Compare with Structured Procrastination by Kleinberg et al. (2017) Code and ( 83 CPU years worth of @ year 2018 CPUs) data: https://github.com/deepmind/leaps-and-bounds 9 / 22

  26. Experiments Configuring the minisat SAT solver (Sorensson and Een, 2005) 1K configurations, 20K nontrivial problem instances Compare with Structured Procrastination by Kleinberg et al. (2017) Code and ( 83 CPU years worth of @ year 2018 CPUs) data: https://github.com/deepmind/leaps-and-bounds 10 3 delta=0 delta=0.05 Mean below delta quantile (s) delta=0.1 delta=0.2 10 2 delta=0.3 delta=0.5 10 1 10 0 0 200 400 600 800 1000 Configurations (sorted according to mean below delta quantile) 9 / 22

  27. Total time spent running configuration (s) LeapsAndBounds Structured Procrastination 10 5 10 4 0 200 400 600 800 1000 Configurations (sorted according to mean below 0.2 quantile) Results ε = 0 . 2 , δ = 0 . 2 , ζ = 0 . 1 Instead of doubling, use θ := 1 . 25 θ Runs can be stopped and resumed (ie ‘continue’ running on an instance) 10 / 22

  28. Results ε = 0 . 2 , δ = 0 . 2 , ζ = 0 . 1 Instead of doubling, use θ := 1 . 25 θ Runs can be stopped and resumed (ie ‘continue’ running on an instance) Total time spent running configuration (s) LeapsAndBounds Structured Procrastination 10 5 10 4 0 200 400 600 800 1000 Configurations (sorted according to mean below 0.2 quantile) 10 / 22

Recommend


More recommend