iterative convex regularization
play

Iterative Convex Regularization Lorenzo Rosasco Universita di - PowerPoint PPT Presentation

Iterative Convex Regularization Lorenzo Rosasco Universita di Genova Universita di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop, Les Houches, Montevideo,


  1. Iterative Convex Regularization Lorenzo Rosasco Universita’ di Genova Universita’ di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop, Les Houches, Montevideo, January 14 ongoing work with S. Villa IIT-MIT, B.C. Vu IIT-MIT

  2. Early Stopping Iterative Convex Regularization Lorenzo Rosasco Universita’ di Genova Universita’ di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop, Les Houches, Montevideo, January 14 ongoing work with S. Villa IIT-MIT, B.C. Vu IIT-MIT

  3. Plan Optimization & Statistics/Estimation part I: introduction to iterative regularization • part II: iterative convex regularization: problem and results •

  4. Linear Inverse Problems Φ y Φ : H → G Φ w = y, G H linear and bounded Moore-Penrose Solution w † = arg min R ( w ) Φ w = y strongly convex lsc Examples : *endless list here*

  5. Data Φ w = y Data Type I k y � ˆ y k  δ Data Type II � � � Φ ∗ y − ˆ Φ ∗ ˆ � ≤ δ � � y Φ : H → ˆ ˆ G � � Φ ∗ ˆ � Φ ∗ Φ − ˆ Φ � ≤ η � � • Data type I: Deterministic/stochastic noise […] • Data type II: stochastic noise statistical Learning [R. et al. ’05], also econometrics, discretized PDEs (?)

  6. Learning* as an Inverse Problem [De vito et al. ‘05] w † , X i ⌦ ↵ Y i = + N i , i = 1 , . . . , n Can be shown to fit Data Type II with n Φ = 1 Φ ∗ Φ = E XX T , Φ ∗ ˆ X ˆ X i X T i n 1 i =1 δ , η ∼ √ n n y = 1 X ˆ Φ ∗ y = E XY, Φ ∗ ˆ X i Y i n i =1 Nonparametric extensions via RKHS theory: Covariance operators become integral operators *Random Design Regression

  7. Tikhonov Regularization 2 � � � ˆ w λ = arg min ˆ Φ w − ˆ + λ R ( w ) , λ ≥ 0 y � � � w ∈ H Variance Computations ˆ w t, λ k Φ w � y k 2 + λ R ( w ) w λ = arg min w ∈ H Bias - New Trade-Offs (?) - Complexity of Model selection? w † = arg min R ( w ) Φ w = y

  8. From Tikhonov Regularization …to Landweber Regularization R ( w ) = k w k 2 w † = Φ † y ∼ ( Φ ∗ Φ + λ I ) − 1 Φ ∗ y t X ( I − Φ ∗ Φ ) j Φ ∗ y ∼ j =0 w t +1 = w t + Φ ∗ ( Φ w t − y ) w t + ˆ Φ ∗ (ˆ w t +1 = ˆ ˆ Φ ˆ w t − ˆ y )

  9. Landweber Regularization aka Gradient Descent R ( w ) = k w k 2 2 1.5 1 w † = Φ † y 0.5 Y 0 − 0.5 ∼ ( Φ ∗ Φ + λ I ) − 1 Φ ∗ y − 1 t − 1.5 X ( I − Φ ∗ Φ ) j Φ ∗ y − 2 ∼ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 X j =0 w t +1 = w t + Φ ∗ ( Φ w t − y ) w t + ˆ Φ ∗ (ˆ w t +1 = ˆ ˆ Φ ˆ w t − ˆ y )

  10. Landweber Regularization aka Gradient Descent R ( w ) = k w k 2 2 1.5 1 w † = Φ † y 0.5 Y 0 − 0.5 ∼ ( Φ ∗ Φ + λ I ) − 1 Φ ∗ y − 1 t − 1.5 X ( I − Φ ∗ Φ ) j Φ ∗ y − 2 ∼ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 X j =0 w t +1 = w t + Φ ∗ ( Φ w t − y ) w t + ˆ Φ ∗ (ˆ w t +1 = ˆ ˆ Φ ˆ w t − ˆ y )

  11. Landweber Regularization aka Gradient Descent R ( w ) = k w k 2 2 1.5 1 w † = Φ † y 0.5 Y 0 − 0.5 ∼ ( Φ ∗ Φ + λ I ) − 1 Φ ∗ y − 1 t − 1.5 X ( I − Φ ∗ Φ ) j Φ ∗ y − 2 ∼ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 X j =0 w t +1 = w t + Φ ∗ ( Φ w t − y ) w t + ˆ Φ ∗ (ˆ w t +1 = ˆ ˆ Φ ˆ w t − ˆ y )

  12. Landweber Regularization aka Gradient Descent R ( w ) = k w k 2 2 1.5 1 w † = Φ † y 0.5 Y 0 − 0.5 ∼ ( Φ ∗ Φ + λ I ) − 1 Φ ∗ y − 1 t − 1.5 X ( I − Φ ∗ Φ ) j Φ ∗ y − 2 ∼ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 X j =0 w t +1 = w t + Φ ∗ ( Φ w t − y ) w t + ˆ Φ ∗ (ˆ w t +1 = ˆ ˆ Φ ˆ w t − ˆ y )

  13. Landweber Regularization aka Gradient Descent R ( w ) = k w k 2 Emp Err w † = ˆ w † � Φ † ˆ � � ˆ � , w t − ˆ ˆ y w † = Φ † y 1 2 t ∼ ( Φ ∗ Φ + λ I ) − 1 Φ ∗ y 10 10 Val Err t � ˆ w t − w † � � X ( I − Φ ∗ Φ ) j Φ ∗ y � ∼ j =0 1 2 10 10 t w t +1 = w t + Φ ∗ ( Φ w t − y ) w t + ˆ Φ ∗ (ˆ w t +1 = ˆ ˆ Φ ˆ w t − ˆ y )

  14. Landweber Regularization aka Gradient Descent R ( w ) = k w k 2 Emp Err w † = ˆ w † � Φ † ˆ � � ˆ � , w t − ˆ ˆ y w † = Φ † y 1 2 t ∼ ( Φ ∗ Φ + λ I ) − 1 Φ ∗ y 10 10 Semi-Convergence Val Err t � ˆ w t − w † � � X ( I − Φ ∗ Φ ) j Φ ∗ y � ∼ j =0 1 2 10 10 t w t +1 = w t + Φ ∗ ( Φ w t − y ) w t + ˆ Φ ∗ (ˆ w t +1 = ˆ ˆ Φ ˆ w t − ˆ y )

  15. Remarks I R ( w ) = k w k 2 Data type I: • History : iteration+semiconvergence [Landweber ’50] …[…Nemirovski’86…] • Other iterative approaches— some acceleration: nu-method /Chebyshev method [Brakhage ‘87, Nemirovski Polyak'84], conjugate gradient [Nemirovski’86…]… • Deterministic noise [Engl et al. ’96], stocastic noise […,Buhlmann, Yu ’02 (L2 Boosting),Bissantz et al. ’07] • Extensions to noise in the operator [Nemirovski’86,…] • Nonlinear problems [Kaltenbacher et al. ’08] • Banach Spaces [Schuster et al. ‘12]

  16. Remarks II R ( w ) = k w k 2 Data type II: • Deterministic noise Landweber and nu-method [De Vito et al. ‘06] • Stochastic noise/learning Landweber and nu-method [Ong Canu ’04, R et al ’04, Yao et al.’05, Bauer et al. ’06, Caponetto Yao ’07, Raskutti et al.’13] • …also conjugate gradient [Blanchard Cramer ‘10] • …and incremental gradient aka multiple passes SGD [R et al.’14] • …and (convex) loss, subgradient method [Lin, R, Zhou ‘15] • Works really well in practice [Huang et al. ’14, Perronnin et al. ‘13] • Regularization “path” is for free

  17. Remarks III Emp Err w † = ˆ w † � Φ † ˆ � ˆ � � , w t − ˆ ˆ y Take home message Computations/iterations 1 2 t 10 10 Semi-Convergence control Val Err stability/regularization � ˆ w t − w † � � � New trade-offs? 1 2 10 10 w t + ˆ Φ ∗ (ˆ w t +1 = ˆ ˆ Φ ˆ w t − ˆ y )

  18. Can we derive iterative regularization for any (strongly) convex regularization?

  19. Plan part I: introduction to iterative regularization • part II: iterative convex regularization: problem and • results

  20. w t + ˆ Φ ∗ (ˆ w t +1 = ˆ ˆ Φ ˆ w t − ˆ y ) How can I tell the iteration which regularization I want to use? w † = arg min R ( w ) Φ w = y

  21. Iterative Regularization and Early Stopping w t = A ( w 0 , . . . , w t − 1 , Φ , y ) Convergence Exact � w t − w † � � � → 0 , t → ∞ Noisy ∃ t † = t † ( w † , δ , η ) � ˆ � → 0 , w t † − w † � � ( δ , η ) → 0 s.t. Error Bounds ∃ t † = t † ( w † , δ , η ) � ˆ � ≤ ε ( w † , δ , η ) w t † − w † � � s.t. • adaptivity, e.g. via discrepancy or Lepskii principles

  22. Dual Forward Backward (DFB) w † = arg min 2 k · k 2 , R = F + α R ( w ) α � 0 Φ w = y convex lsc � � � − α − 1 Φ ∗ v t w t = prox α − 1 F ( ∀ t ∈ N ) γ t = α v t +1 = v t + γ t ( Φ w t − y ) . • Analogous iteration for noisy data • Special case of dual forward backward splitting [Combettes et al. ’10]… • …also a form of augmented Lagrangian method/ADMM [see Beck Teboulle ‘14] • …also can be shown to be equivalent to linearized Bregmanized operator splitting [Burger, Osher et al. …] • Reduces to Landweber iteration if we consider only the squared norm

  23. Analysis for Data Type I [R.Villa Vu et al.’14] �  k ˆ � ˆ w t − w † � � w t � w † � � � w t � w t k + � � / ( α � v † � √ � t ) Theorem . If there exists v † ∈ G such that Φ ∗ v † ∈ ∂ R ( w † ) the DFB sequence ( w t ) t for v 0 = 0 satisfies k w t � w † k  k v † k p t α 2 k w t � w † k 2  D ( v t ) � D ( v † ) α Proof idea

  24. Analysis for Data Type I [R.Villa Vu et al.’14] �  k ˆ � ˆ w t − w † � � w t � w † � � � w t � w t k + � � / ( α � v † � √ � c δ t t ) Theorem . Let ( w t ) t , ( ˆ w t ) t be the DFB sequences for ˆ v 0 = v 0 = 0 . Then it holds w t � w t k  2 t δ k ˆ k Φ k

  25. Analysis for Data Type I [R.Villa Vu et al.’14] �  k ˆ � ˆ w t − w † � � w t � w † � � � w t � w t k + � � / ( α � v † � √ � c δ t t ) t † = c δ − 2 / 3 ⇒ � ≤ c δ 1 / 3 � ˆ w t † − w † � �

  26. Analysis for Data Type II [R. Villa Vu et al.’14] �  k ˆ � ˆ w t � w † � � w t � w † � � � w t � w t k + � ( δ + η )(1 + c ) t � / ( α � v † � √ � t ) c p ˆ t � w † k  t = c log 1 / ( δ + η ) ) k ˆ w ˆ log(1 / p δ + η ) p

  27. Remarks Data Type I • General convex setting— only weak convergence [Burger, Osher et al. ~’09’10], no stability results, no strong convergence. • Sparsity based regularization [Osher et al. ’14] Data Type II • No previous results , either convergence or error bounds. • Directly give results for statistical learning . • Acceleration possible, but stability harder to prove (e.g. via dual FISTA, Chambolle Pock…) • Polynomial estimates of variance under stronger conditions (satisfied in certain smooth cases, e.g. Landweber) • Connections to regularization path , e.g. Lasso path/Lars Results…

Recommend


More recommend