Safe Grid Search with Optimal Complexity E. Ndiaye Riken AIP Joint work with: T. Le, O. Fercoq, J. Salmon, I. Takeuchi 1 / 7
Hyperparameter Tuning β ( λ ) ∈ arg min ˆ � Learning Task: f ( X train β )+ λ Ω( β ) β ∈ R p E v (ˆ β ( λ ) ) = L ( y test , X test ˆ β ( λ ) ) � Evaluation: 3 . 6 3 . 0 Validation curve at machine precision 3 . 4 2 . 9 � y test − X test β ( λ ) � 2 3 . 2 � y test − X test β ( λ ) � 2 3 . 0 2 . 8 2 . 8 2 . 7 2 . 6 2 . 6 2 . 4 Validation curve at machine precision 2 . 2 2 . 5 λ min λ max λ min λ max Regularization hyperparameter λ Regularization hyperparameter λ How to approximate the best hyperparameter? 2 / 7
Hyperparameter Tuning The optimal hyperparameter is given by E v (ˆ β ( λ ) ) = L ( y test , X test ˆ β ( λ ) ) arg min λ ∈ [ λ min ,λ max ] β ( λ ) ∈ arg min s.t. ˆ f ( X train β ) + λ Ω( β ) β ∈ R p Issues: � The objective λ �→ E v (ˆ β ( λ ) ) is non-smooth and non-convex � Often, It is unpractical to evaluate E v ( ˆ β ( λ ) ) 3 / 7
Tracking the curve of solutions β ( λ ) ∈ arg min ˆ f ( Xβ ) + λ Ω( β ) β ∈ R p Exact Path: For ( f, Ω) = (Piecewise Quadratic, Piecewise Linear) β ( λ ) is piecewise linear ( Lars 1 algorithm). → ˆ the function λ �− 1 (Efron et al. , 2004) 2 (Mairal and Yu, 2012) 3 (Bousquet and Bottou, 2008) 4 / 7
Tracking the curve of solutions β ( λ ) ∈ arg min ˆ f ( Xβ ) + λ Ω( β ) β ∈ R p Exact Path: For ( f, Ω) = (Piecewise Quadratic, Piecewise Linear) β ( λ ) is piecewise linear ( Lars 1 algorithm). → ˆ the function λ �− Drawbacks: � Exponential 2 complexity for Lasso O ((3 p + 1) / 2) � Numerical instabilities � Hard to generalize to others (loss, regularization) � Cannot benefited of early stopping rule 3 . 1 (Efron et al. , 2004) 2 (Mairal and Yu, 2012) 3 (Bousquet and Bottou, 2008) 4 / 7
Approximation of the solution path 4 β ( λ ) ∈ arg min ˆ f ( Xβ ) + λ Ω( β ) =: P λ ( β ) Training Task: β ∈ R p � 1 − λ � P λ ( β ( λ t ) ) − P λ (ˆ β ( λ ) ) ≤ Q t, V f ∗ Suboptimal gap: . λ t Upper Bound of the Duality Gap ǫ ǫ c λ min λ 5 λ 4 λ 3 λ 2 λ 1 λ max Q t, V f ∗ ( ρ ) := optimization error at λ t + approximation error ( λ, λ t ) , 4 (Giesen et al. 2012) 5 / 7
Bound the validation Gap � E v (ˆ β ( λ ) ) − E v ( β ( λ t ) ) � ≤ max L ( X ′ β, X ′ β ( λ t ) ) , � � β ∈B λ � � ∋ ˆ β ( λ t ) , Suboptimal gap on the training β ( λ ) B λ = Ball • − → Approximate the validation path ! 6 / 7
Bound the validation Gap � E v (ˆ β ( λ ) ) − E v ( β ( λ t ) ) � ≤ max L ( X ′ β, X ′ β ( λ t ) ) , � � β ∈B λ � � ∋ ˆ β ( λ t ) , Suboptimal gap on the training β ( λ ) B λ = Ball • − → Approximate the validation path ! High 3 . 2 Validation curve at precision machine precision δ v / 10 3 . 0 � y ′ − X ′ β ( λ ) � 2 ǫ v 2 . 8 2 . 6 Low precision 2 . 4 δ v × 10 λ min λ max 6 / 7
E v ( β ( λ t ) ) − λ ∈ [ λ min ,λ max ] E v (ˆ β ( λ ) ) ≤ ǫ v . min min λ t ∈ Λ val( ǫv ) Code: https://github.com/EugeneNdiaye/safe grid search Let’s talk during the poster session ;-) 7 / 7
Recommend
More recommend