Research Goal : reliable and easy-to-use optimizers for ML. 1 10 - PowerPoint PPT Presentation

Aaron Mishkin Research Goal : reliable and easy-to-use optimizers for ML. 1 ⁄ 10

Challenges in Optimization for ML Stochastic gradient methods are the most popular algorithms for fitting ML models, w k +1 = w k − η k ∇ ˜ SGD: f ( w k ) . But practitioners face major challenges with • Speed : step-size decay-schedule controls convergence rate. • Stability : hyper-parameters must be tuned carefully. • Generalization : optimizers encode statistical tradeoffs. 2 ⁄ 10

Better Optimization via Better Models Idea : exploit model properties for better optimization. � n Consider minimizing f ( w ) = 1 i =1 f i ( w ). We say f satisfies n interpolation if ∀ w , f ( w ∗ ) ≤ f ( w ) = ⇒ f i ( w ∗ ) ≤ f i ( w ) . 3 ⁄ 10

First Steps: Constant Step-size SGD Interpolation and smoothness imply a noise bound , E �∇ f i ( w ) � 2 ≤ C ( f ( w ) − f ( w ∗ )) . • SGD converges with a constant step-size [1, 5]. • SGD is as fast as gradient descent. • SGD converges to the ◮ minimum L 2 -norm solution for linear regression [7]. ◮ max-margin solution for logistic regression [4]. Takeaway : optimization speed and (some) statistical trade-offs. 4 ⁄ 10

Current Work: Robust Parameter-free SGD We can even pick η k using backtracking line-search [6]! Armijo Condition : f i ( w k +1 ) ≤ f i ( w k ) − c η k �∇ f i ( w k ) � 2 . 5 ⁄ 10

Stochastic Line-Searches in Practice Classification accuracy for ResNet-34 models trained on MNIST, CIFAR-10, and CIFAR-100. 6 ⁄ 10

Questions. 7 ⁄ 10

Bonus: Robust Acceleration for SGD Synthetic Matrix Fac. Training Loss 10 4 10 10 0 50 100 150 200 250 300 350 Iterations Adam SGD + Armijo Nesterov + Armijo Stochastic acceleration is possible [3, 5], but • it’s unstable with the backtracking Armijo line-search; and • the ”acceleration” parameter must be fine-tuned . Potential Solutions: • more sophisticated line-search (e.g. FISTA [2]). • stochastic restarts for oscilations. 8 ⁄ 10

References I [1] Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of sgd in non-convex over-parametrized learning. arXiv preprint arXiv:1811.02564 , 2018. [2] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences , 2(1):183–202, 2009. [3] Chaoyue Liu and Mikhail Belkin. Accelerating sgd with momentum for over-parameterized learning. In ICLR , 2020. [4] Mor Shpigel Nacson, Nathan Srebro, and Daniel Soudry. Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate. arXiv preprint arXiv:1806.01796 , 2018. 9 ⁄ 10

References II [5] Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd International Conference on Artificial Intelligence and Statistics , pages 1195–1204, 2019. [6] Sharan Vaswani, Aaron Mishkin, Issam H. Laradji, Mark Schmidt, Gauthier Gidel, and Simon Lacoste-Julien. Painless stochastic gradient: Interpolation, line-search, and convergence rates. In NeurIPS , pages 3727–3740, 2019. [7] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In NeurIPS , pages 4148–4158, 2017. 10 ⁄ 10

Research Goal : reliable and easy-to-use optimizers for ML. 1 10 - PowerPoint PPT Presentation

Aaron Mishkin Research Goal : reliable and easy-to-use optimizers for ML. 1 10 Challenges in Optimization for ML Stochastic gradient methods are the most popular algorithms for fitting ML models, w k +1 = w k k SGD: f ( w k ) .

COMPRESSING GRADIENT OPTIMIZERS VIA COUNT-SKETCHES Ryan Spring, Anastasios Kyrillidis, Vijai

Easy-to-Use Easy-to-Install Easy on the Budget orecx.com Easy-to-Use

Learning with Differentiable Perturbed Optimizers Quentin Berthet Youth in High-dimensions -

Learning with Differentiable Perturbed Optimizers Quentin Berthet Optimization for ML - CIRM -

An Introduction to Particle Swarm Multi-Objective Optimizers Carlos A. Coello Coello

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Easy Flype & Easy HiFlype Peripheral Self-Expanding Stent System 20/07/2018 Easy Flype

Interaction with Route Optimizers Oliver Turnbull and Arthur Richards

Natural Refrigerants Natural Refrigerants Natural Refrigerants Natural Refrigerants Safe

Two-player games between polynomial optimizers and semidefinite solvers Victor Magron ,

Randomized Stress-Testing of Link-Time Optimizers Vu Le, Chengnian Sun , Zhendong Su University of

AMMI Introduction to Deep Learning 5.3. PyTorch optimizers Fran cois Fleuret

ML HPC: Optimizing Optimizers for Optimization Workshop on the Convergence of ML & HPC

RELIABILITY RELIABILITY and and RELIABLE DESIGN RELIABLE DESIGN Giovanni De Micheli Micheli

Reliable Power Reliable Markets AESO Rule Consultation Loss Factors Rule 9.2 and Appendix 7

Networks on Structured Data Yingyu Liang@UW-Madison Joint work with Yuanzhi Li@Princeton

Theories of Neural Networks Training Lazy and Mean Field Regimes c Chizat * , joint work with

Marcel Dettling Institute for Data Analysis and Process Design Zurich University of Applied

Workshop 10.6a: Poisson regression Murray Logan 12 Sep 2016 Section 1 Poisson regression

A Bayesian approach to estimate the number and position of knots for linear regression splines

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

When Ensembling Smaller Models is More Efficient than Single Large Models Dan Kondratyuk, Mingxing

Workshop 10.4: Generalized linear models Murray Logan February 15, 2017 Table of contents 1

Research Goal : reliable and easy-to-use optimizers for ML. 1 10 - PowerPoint PPT Presentation

Aaron Mishkin Research Goal : reliable and easy-to-use optimizers for ML. 1 10 Challenges in Optimization for ML Stochastic gradient methods are the most popular algorithms for fitting ML models, w k +1 = w k k SGD: f ( w k ) .

COMPRESSING GRADIENT OPTIMIZERS VIA COUNT-SKETCHES Ryan Spring, Anastasios Kyrillidis, Vijai

Easy-to-Use Easy-to-Install Easy on the Budget orecx.com Easy-to-Use

Learning with Differentiable Perturbed Optimizers Quentin Berthet Youth in High-dimensions -

Learning with Differentiable Perturbed Optimizers Quentin Berthet Optimization for ML - CIRM -

An Introduction to Particle Swarm Multi-Objective Optimizers Carlos A. Coello Coello

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Prior Work Consensus Consensus Reliable BGP Consensus Reliable BGP Consensus Routing

Easy Flype &amp; Easy HiFlype Peripheral Self-Expanding Stent System 20/07/2018 Easy Flype

Interaction with Route Optimizers Oliver Turnbull and Arthur Richards

Natural Refrigerants Natural Refrigerants Natural Refrigerants Natural Refrigerants Safe

Two-player games between polynomial optimizers and semidefinite solvers Victor Magron ,

Randomized Stress-Testing of Link-Time Optimizers Vu Le, Chengnian Sun , Zhendong Su University of

AMMI Introduction to Deep Learning 5.3. PyTorch optimizers Fran cois Fleuret

ML HPC: Optimizing Optimizers for Optimization Workshop on the Convergence of ML &amp; HPC

RELIABILITY RELIABILITY and and RELIABLE DESIGN RELIABLE DESIGN Giovanni De Micheli Micheli

Reliable Power Reliable Markets AESO Rule Consultation Loss Factors Rule 9.2 and Appendix 7

Networks on Structured Data Yingyu Liang@UW-Madison Joint work with Yuanzhi Li@Princeton

Theories of Neural Networks Training Lazy and Mean Field Regimes c Chizat * , joint work with

Marcel Dettling Institute for Data Analysis and Process Design Zurich University of Applied

Workshop 10.6a: Poisson regression Murray Logan 12 Sep 2016 Section 1 Poisson regression

A Bayesian approach to estimate the number and position of knots for linear regression splines

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

When Ensembling Smaller Models is More Efficient than Single Large Models Dan Kondratyuk, Mingxing

Workshop 10.4: Generalized linear models Murray Logan February 15, 2017 Table of contents 1

Easy Flype & Easy HiFlype Peripheral Self-Expanding Stent System 20/07/2018 Easy Flype

ML HPC: Optimizing Optimizers for Optimization Workshop on the Convergence of ML & HPC