global convergence of gradient descent for non convex
play

Global convergence of gradient descent for non-convex learning - PowerPoint PPT Presentation

Global convergence of gradient descent for non-convex learning problems Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France C O L E N O R M A L E S U P R I E U R E Joint work with L ena c Chizat Institut Henri


  1. Global convergence of gradient descent for non-convex learning problems Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France É C O L E N O R M A L E S U P É R I E U R E Joint work with L´ ena¨ ıc Chizat Institut Henri Poincar´ e - April 5, 2019

  2. Machine learning Scientific context • Proliferation of digital data – Personal data – Industry – Scientific: from bioinformatics to humanities • Need for automated processing of massive data

  3. Machine learning Scientific context • Proliferation of digital data – Personal data – Industry – Scientific: from bioinformatics to humanities • Need for automated processing of massive data • Series of “hypes” Big data → Data science → Machine Learning → Deep Learning → Artificial Intelligence

  4. Machine learning Scientific context • Proliferation of digital data – Personal data – Industry – Scientific: from bioinformatics to humanities • Need for automated processing of massive data • Series of “hypes” Big data → Data science → Machine Learning → Deep Learning → Artificial Intelligence • Healthy interactions between theory, applications, and hype?

  5. Recent progress in perception (vision, audio, text) person ride dog From translate.google.fr From Peyr´ e et al. (2017)

  6. Recent progress in perception (vision, audio, text) person ride dog From translate.google.fr From Peyr´ e et al. (2017) (1) Massive data (2) Computing power (3) Methodological and scientific progress

  7. Recent progress in perception (vision, audio, text) person ride dog From translate.google.fr From Peyr´ e et al. (2017) (1) Massive data (2) Computing power (3) Methodological and scientific progress “Intelligence” = models + algorithms + data + computing power

  8. Recent progress in perception (vision, audio, text) person ride dog From translate.google.fr From Peyr´ e et al. (2017) (1) Massive data (2) Computing power (3) Methodological and scientific progress “Intelligence” = models + algorithms + data + computing power

  9. Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d

  10. Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • Advertising : n > 10 9 – Φ( x ) ∈ { 0 , 1 } d , d > 10 9 – Navigation history + ad - Linear predictions - h ( x, θ ) = θ ⊤ Φ( x )

  11. Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • Advertising : n > 10 9 – Φ( x ) ∈ { 0 , 1 } d , d > 10 9 – Navigation history + ad • Linear predictions – h ( x, θ ) = θ ⊤ Φ( x )

  12. Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d x 1 x 2 x 3 x 4 x 5 x 6 y 1 = 1 y 2 = 1 y 3 = 1 y 4 = − 1 y 5 = − 1 y 6 = − 1

  13. Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d x 1 x 2 x 3 x 4 x 5 x 6 y 1 = 1 y 2 = 1 y 3 = 1 y 4 = − 1 y 5 = − 1 y 6 = − 1 – Neural networks ( n, d > 10 6 ): h ( x, θ ) = θ ⊤ m σ ( θ ⊤ m − 1 σ ( · · · θ ⊤ 2 σ ( θ ⊤ 1 x )) θ 1 θ 2 θ 3 y x

  14. Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • (regularized) empirical risk minimization : n n 1 = 1 � � � � � � min ℓ y i , h ( x i , θ ) + λ Ω( θ ) f i ( θ ) n n θ ∈ R d i =1 i =1 data fitting term + regularizer

  15. Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • (regularized) empirical risk minimization : n n 1 = 1 � � � � � � min ℓ y i , h ( x i , θ ) + λ Ω( θ ) f i ( θ ) n n θ ∈ R d i =1 i =1 data fitting term + regularizer • Actual goal : minimize test error E p ( x,y ) ℓ ( y, h ( x, θ ))

  16. Convex optimization problems • Convexity in machine learning – Convex loss and linear predictions h ( x, θ ) = θ ⊤ Φ( x )

  17. Convex optimization problems • Convexity in machine learning – Convex loss and linear predictions h ( x, θ ) = θ ⊤ Φ( x ) • (approximately) Matching theory and practice – Fruitful discussions between theoreticians and practitioners – Quantitative theoretical analysis suggests practical improvements

  18. Convex optimization problems • Convexity in machine learning – Convex loss and linear predictions h ( x, θ ) = θ ⊤ Φ( x ) • (approximately) Matching theory and practice – Fruitful discussions between theoreticians and practitioners – Quantitative theoretical analysis suggests practical improvements • Golden years of convexity in machine learning (1995 to 201*) – Support vector machines and kernel methods – Inference in graphical models – Sparsity / low-rank models with first-order methods – Convex relaxation of unsupervised learning problems – Optimal transport – Stochastic methods for large-scale learning and online learning

  19. Convex optimization problems • Convexity in machine learning – Convex loss and linear predictions h ( x, θ ) = θ ⊤ Φ( x ) • (approximately) Matching theory and practice – Fruitful discussions between theoreticians and practitioners – Quantitative theoretical analysis suggests practical improvements • Golden years of convexity in machine learning (1995 to 201*) – Support vector machines and kernel methods – Inference in graphical models – Sparsity / low-rank models with first-order methods – Convex relaxation of unsupervised learning problems – Optimal transport – Stochastic methods for large-scale learning and online learning

  20. Exponentially convergent SGD for smooth finite sums n n 1 f i ( θ ) = 1 � � � � � � • Finite sums : min ℓ y i , h ( x i , θ ) + λ Ω( θ ) n n θ ∈ R d i =1 i =1

  21. Exponentially convergent SGD for smooth finite sums n n 1 f i ( θ ) = 1 � � � � � � • Finite sums : min ℓ y i , h ( x i , θ ) + λ Ω( θ ) n n θ ∈ R d i =1 i =1 • Non-accelerated algorithms (with similar properties) – SAG (Le Roux, Schmidt, and Bach, 2012) – SDCA (Shalev-Shwartz and Zhang, 2013) – SVRG (Johnson and Zhang, 2013; Zhang et al., 2013) – MISO (Mairal, 2015), Finito (Defazio et al., 2014a) – SAGA (Defazio, Bach, and Lacoste-Julien, 2014b), etc... n ∇ f i ( t ) ( θ t − 1 )+1 � � � y t − 1 − y t − 1 θ t = θ t − 1 − γ i i ( t ) n i =1

  22. Exponentially convergent SGD for smooth finite sums n n 1 f i ( θ ) = 1 � � � � � � • Finite sums : min ℓ y i , h ( x i , θ ) + λ Ω( θ ) n n θ ∈ R d i =1 i =1 • Non-accelerated algorithms (with similar properties) – SAG (Le Roux, Schmidt, and Bach, 2012) – SDCA (Shalev-Shwartz and Zhang, 2013) – SVRG (Johnson and Zhang, 2013; Zhang et al., 2013) – MISO (Mairal, 2015), Finito (Defazio et al., 2014a) – SAGA (Defazio, Bach, and Lacoste-Julien, 2014b), etc... n ∇ f i ( t ) ( θ t − 1 )+1 � � � y t − 1 − y t − 1 θ t = θ t − 1 − γ i i ( t ) n i =1

  23. Exponentially convergent SGD for smooth finite sums n n 1 f i ( θ ) = 1 � � � � � � • Finite sums : min ℓ y i , h ( x i , θ ) + λ Ω( θ ) n n θ ∈ R d i =1 i =1 • Non-accelerated algorithms (with similar properties) – SAG (Le Roux, Schmidt, and Bach, 2012) – SDCA (Shalev-Shwartz and Zhang, 2013) – SVRG (Johnson and Zhang, 2013; Zhang et al., 2013) – MISO (Mairal, 2015), Finito (Defazio et al., 2014a) – SAGA (Defazio, Bach, and Lacoste-Julien, 2014b), etc... • Accelerated algorithms – Shalev-Shwartz and Zhang (2014); Nitanda (2014) – Lin et al. (2015b); Defazio (2016), etc... – Catalyst (Lin, Mairal, and Harchaoui, 2015a)

  24. Exponentially convergent SGD for finite sums • Running-time to reach precision ε (with κ = condition number) � × log 1 Gradient descent d × � nκ � ε � n √ κ � × log 1 Accelerated gradient descent d × � ε

  25. Exponentially convergent SGD for finite sums • Running-time to reach precision ε (with κ = condition number) � 1 Stochastic gradient descent d × � κ × � ε � × log 1 Gradient descent d × � nκ � ε � n √ κ � × log 1 Accelerated gradient descent d × � ε � × log 1 SAG(A), SVRG, SDCA, MISO d × � ( n + κ ) � ε

  26. Exponentially convergent SGD for finite sums • Running-time to reach precision ε (with κ = condition number) � 1 Stochastic gradient descent d × � κ × � ε � × log 1 Gradient descent d × � nκ � ε � n √ κ � × log 1 Accelerated gradient descent d × � ε � × log 1 SAG(A), SVRG, SDCA, MISO d × � ( n + κ ) � ε � ( n + √ nκ ) � � × log 1 Accelerated versions d × � ε NB: slightly different (smaller) notion of condition number for batch methods

Recommend


More recommend