safe grid search with optimal complexity
play

Safe Grid Search with Optimal Complexity Joseph Salmon - PowerPoint PPT Presentation

Safe Grid Search with Optimal Complexity Joseph Salmon http://josephsalmon.eu IMAG, Univ Montpellier, CNRS Montpellier, France Joint work with: E. Ndiaye (RIKEN, Nagoya) T. Le (RIKEN, Tokyo) O. Fercoq (Institut Polytechnique de Paris) I.


  1. Safe Grid Search with Optimal Complexity Joseph Salmon http://josephsalmon.eu IMAG, Univ Montpellier, CNRS Montpellier, France Joint work with: E. Ndiaye (RIKEN, Nagoya) T. Le (RIKEN, Tokyo) O. Fercoq (Institut Polytechnique de Paris) I. Takeuchi (Nagoya Institute of Technology) 1 / 22

  2. Simplest model: standard sparse regression y P R n : a signal X “ r x 1 , . . . , x p s P R n ˆ p : dictionary of atoms/features Assumption : signal well approximated by a sparse combination β ˚ P R p : y « Xβ ˚ Objective(s): find ˆ β β ˚ » fi » fi » fi 1 § Estimation: ˆ β « ˆ β ˚ . . « ¨ – y – x 1 . . . x p — ffi . fl fl § Prediction: X ˆ β « X ˆ – fl β ˚ β ˚ p § Support recovery: lo omo on loooooooomoooooooon lo omo on y P R n X P R n ˆ p β P R p supp p ˆ β q « supp p β ˚ q p ÿ β ˚ y « j x j Constraints: large p , sparse β ˚ j “ 1 2 / 22

  3. The ℓ 1 penalty: Lasso and variants Vocabulary: the “Modern least squares” Candès et al. (2008) § Statistics: Lasso Tibshirani (1996) § Signal processing variant: Basis Pursuit Chen et al. (1998) ˆ 1 ˙ β p λ q P arg min ˆ 2 } y ´ Xβ } 2 ` λ } β } 1 β P R p loooooomoooooon lo omo on data fitting term sparsity-inducing penalty § Solutions are sparse (sparsity level controlled by λ ) 3 / 22

  4. The ℓ 1 penalty: Lasso and variants Vocabulary: the “Modern least squares” Candès et al. (2008) § Statistics: Lasso Tibshirani (1996) § Signal processing variant: Basis Pursuit Chen et al. (1998) ˆ 1 ˙ β p λ q P arg min ˆ 2 } y ´ Xβ } 2 ` λ } β } 1 β P R p loooooomoooooon lo omo on data fitting term sparsity-inducing penalty § Solutions are sparse (sparsity level controlled by λ ) § Need to tune/choose λ (standard is Cross-Validation) 3 / 22

  5. The ℓ 1 penalty: Lasso and variants Vocabulary: the “Modern least squares” Candès et al. (2008) § Statistics: Lasso Tibshirani (1996) § Signal processing variant: Basis Pursuit Chen et al. (1998) ˆ 1 ˙ β p λ q P arg min ˆ 2 } y ´ Xβ } 2 ` λ } β } 1 β P R p loooooomoooooon lo omo on data fitting term sparsity-inducing penalty § Solutions are sparse (sparsity level controlled by λ ) § Need to tune/choose λ (standard is Cross-Validation) § Theoretical guaranties Bickel et al. (2009) 3 / 22

  6. The ℓ 1 penalty: Lasso and variants Vocabulary: the “Modern least squares” Candès et al. (2008) § Statistics: Lasso Tibshirani (1996) § Signal processing variant: Basis Pursuit Chen et al. (1998) ˆ 1 ˙ β p λ q P arg min ˆ 2 } y ´ Xβ } 2 ` λ } β } 1 β P R p loooooomoooooon lo omo on data fitting term sparsity-inducing penalty § Solutions are sparse (sparsity level controlled by λ ) § Need to tune/choose λ (standard is Cross-Validation) § Theoretical guaranties Bickel et al. (2009) § Refinements: non-convex approaches Adaptive Lasso Zou (2006), scaled invariance Sun and Zhang (2012), etc. 3 / 22

  7. The ℓ 1 penalty: Lasso and variants Vocabulary: the “Modern least squares” Candès et al. (2008) § Statistics: Lasso Tibshirani (1996) § Signal processing variant: Basis Pursuit Chen et al. (1998) ˆ 1 ˙ β p λ q P arg min ˆ 2 } y ´ Xβ } 2 ` λ } β } 1 β P R p loooooomoooooon lo omo on data fitting term sparsity-inducing penalty § Solutions are sparse (sparsity level controlled by λ ) § Need to tune/choose λ (standard is Cross-Validation) § Theoretical guaranties Bickel et al. (2009) § Refinements: non-convex approaches Adaptive Lasso Zou (2006), scaled invariance Sun and Zhang (2012), etc. 3 / 22

  8. Well... many Lassos are needed 1 β p λ q P arg min ˆ 2 } y ´ Xβ } 2 2 ` λ } β } 1 β P R p In practice: Step 1 compute T solutions on a grid, i.e., compute β p λ 0 q , . . . , β p λ T ´ 1 q approximating ˆ β p λ 0 q , . . . , ˆ β p λ T ´ 1 q , for some λ 0 ą ¨ ¨ ¨ ą λ T ´ 1 Step 2 pick the “best” parameter Questions : § performance criterion: how to pick a “best” λ ? § cross-validation (and variant) § SURE (Stein Unbiased Risk Estimation) § etc. § grid choice: how to design the grid itself? 4 / 22

  9. In practice: who does what? Standard grid: (R-glmnet / Python-sklearn): geometric grid p § λ 0 “ λ max : “ } X J y } 8 “ max j “ 1 x x j , y y (critical value) § λ t “ λ max ˆ 10 ´ δt {p T ´ 1 q , T “ 100 and δ “ 3 § λ T ´ 1 “ λ max { 10 3 : “ λ min Parameter’s choice: Python- sklearn : vanilla 5-fold Cross-Validation, get smallest mean squared error (averaged over folds) R- glmnet : vanilla 10-fold Cross-Validation, get largest λ such that the error is smaller than the mean squared error (averaged over folds) + 1 standard deviation 5 / 22

  10. Hold-out cross-validation From now on : hold-out cross-validation (one single split) Standard choice: 80 % train p n train q , 20 % test p n test q § X “ X train Y X test § y “ y train Y y test § Change the error on test (validation): � β p λ q � E test p ˆ β p λ q q “ L p y test , X test ˆ β p λ q q : “ � y test ´ X test ˆ � � � ˆ 2 ˙ � β p λ q � � y test ´ X test ˆ or � � � 6 / 22

  11. Some practical examples § leukemia (1) : n “ 72 , p “ 7129 (genes expression) y (binary) measure of disease § diabetes (2) : n “ 442 , p “ 10 (Age, Sex, Body mass index, Average blood pressure, S1, S2, S3, S4, S5, S6) y a quantitative measure of disease progression one year after baseline (1) https://sklearn.org/modules/generated/sklearn.datasets.fetch_mldata.html (2) https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset 7 / 22

  12. Example: Training / Testing ( leukemia ) 1 . 0 0 . 8 P λ ( β ) /P λ (0) 0 . 6 0 . 4 Exact: P λ (ˆ β ( λ ) ) 0 . 2 Exact shifted: P λ (ˆ β ( λ ) ) + ǫ λ min λ max Training β λ � 2 / � y test � 2 Exact 1 . 5 � y test − X test ˆ 1 . 0 0 . 5 λ min λ max Testing 8 / 22

  13. Example: Training / Testing ( leukemia ) 1 . 0 0 . 8 P λ ( β ) /P λ (0) 0 . 6 Exact: P λ (ˆ β ( λ ) ) 0 . 4 Exact shifted: P λ (ˆ β ( λ ) ) + ǫ 0 . 2 Approximated: P λ ( β ( λ ) ) λ min λ max Training β λ � 2 / � y test � 2 Exact 1 . 5 Approx. � y test − X test ˆ 1 . 0 0 . 5 λ min λ max Testing 8 / 22

  14. Example: Training / Testing ( diabetes ) 1 . 10 1 . 05 P λ ( β ) /P λ (0) 1 . 00 0 . 95 Exact: P λ (ˆ β ( λ ) ) 0 . 90 Exact shifted: P λ (ˆ β ( λ ) ) + ǫ λ min λ max Training β λ � 2 / � y test � 2 1 . 04 Exact 1 . 02 � y test − X test ˆ 1 . 00 0 . 98 λ min λ max Testing 9 / 22

  15. Example: Training / Testing ( diabetes ) 1 . 10 1 . 05 P λ ( β ) /P λ (0) 1 . 00 Exact: P λ (ˆ β ( λ ) ) 0 . 95 Exact shifted: P λ (ˆ β ( λ ) ) + ǫ 0 . 90 Approximated: P λ ( β ( λ ) ) λ min λ max Training β λ � 2 / � y test � 2 1 . 04 Exact Approx. 1 . 02 � y test − X test ˆ 1 . 00 0 . 98 λ min λ max Testing 9 / 22

  16. Hyperparameter tuning β p λ q P arg min ˆ § Learning Task: f p X train β q ` λ Ω p β q β P R p looooomooooon lo omo on 1 � β � 1 2 � X train β ´ y train � 2 E test p ˆ β p λ q q “ L p y test , X test ˆ β p λ q q § Evaluation: 3 . 6 Validation curve at 3 . 0 machine precision 3 . 4 2 . 9 � y test − X test β ( λ ) � 2 � y test − X test β ( λ ) � 2 3 . 2 3 . 0 2 . 8 2 . 8 2 . 7 2 . 6 2 . 6 2 . 4 Validation curve at machine precision 2 . 2 2 . 5 λ min λ max λ min λ max Regularization hyperparameter λ Regularization hyperparameter λ How to choose the grid of hyperparameter? 10 / 22

  17. Hyperparameter tuning β p λ q P arg min ˆ § Learning Task: f p X train β q ` λ Ω p β q β P R p looooomooooon lo omo on 1 � β � 1 2 � X train β ´ y train � 2 E test p ˆ β p λ q q “ L p y test , X test ˆ β p λ q q § Evaluation: 3 . 6 Validation curve at 3 . 0 machine precision 3 . 4 2 . 9 � y test − X test β ( λ ) � 2 � y test − X test β ( λ ) � 2 3 . 2 3 . 0 2 . 8 2 . 8 2 . 7 2 . 6 2 . 6 2 . 4 Validation curve at machine precision 2 . 2 2 . 5 λ min λ max λ min λ max Regularization hyperparameter λ Regularization hyperparameter λ How to choose the grid of hyperparameter? 10 / 22

  18. Hyperparameter tuning as bilevel optimization The “optimal” hyperparameter is given by ˆ E test p ˆ β p λ q q “ L p y test , X test ˆ β p λ q q λ P arg min λ Pr λ min ,λ max s β p λ q P arg min s.t. ˆ f p X train β q ` λ Ω p β q β P R p Challenges: § non-smooth and non-convex objective function § costly to evaluate E test p ˆ β p λ q q ( e.g., dense/continuous grid) 11 / 22

  19. Hyperparameter tuning as bilevel optimization The “optimal” hyperparameter is given by ˆ E test p ˆ β p λ q q “ L p y test , X test ˆ β p λ q q λ P arg min λ Pr λ min ,λ max s β p λ q P arg min s.t. ˆ f p X train β q ` λ Ω p β q β P R p Challenges: § non-smooth and non-convex objective function § costly to evaluate E test p ˆ β p λ q q ( e.g., dense/continuous grid) 11 / 22

Recommend


More recommend