Global convergence of gradient descent for non-convex learning problems Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France É C O L E N O R M A L E S U P É R I E U R E Joint work with L´ ena¨ ıc Chizat Institut Henri Poincar´ e - April 5, 2019
Machine learning Scientific context • Proliferation of digital data – Personal data – Industry – Scientific: from bioinformatics to humanities • Need for automated processing of massive data
Machine learning Scientific context • Proliferation of digital data – Personal data – Industry – Scientific: from bioinformatics to humanities • Need for automated processing of massive data • Series of “hypes” Big data → Data science → Machine Learning → Deep Learning → Artificial Intelligence
Machine learning Scientific context • Proliferation of digital data – Personal data – Industry – Scientific: from bioinformatics to humanities • Need for automated processing of massive data • Series of “hypes” Big data → Data science → Machine Learning → Deep Learning → Artificial Intelligence • Healthy interactions between theory, applications, and hype?
Recent progress in perception (vision, audio, text) person ride dog From translate.google.fr From Peyr´ e et al. (2017)
Recent progress in perception (vision, audio, text) person ride dog From translate.google.fr From Peyr´ e et al. (2017) (1) Massive data (2) Computing power (3) Methodological and scientific progress
Recent progress in perception (vision, audio, text) person ride dog From translate.google.fr From Peyr´ e et al. (2017) (1) Massive data (2) Computing power (3) Methodological and scientific progress “Intelligence” = models + algorithms + data + computing power
Recent progress in perception (vision, audio, text) person ride dog From translate.google.fr From Peyr´ e et al. (2017) (1) Massive data (2) Computing power (3) Methodological and scientific progress “Intelligence” = models + algorithms + data + computing power
Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d
Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • Advertising : n > 10 9 – Φ( x ) ∈ { 0 , 1 } d , d > 10 9 – Navigation history + ad - Linear predictions - h ( x, θ ) = θ ⊤ Φ( x )
Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • Advertising : n > 10 9 – Φ( x ) ∈ { 0 , 1 } d , d > 10 9 – Navigation history + ad • Linear predictions – h ( x, θ ) = θ ⊤ Φ( x )
Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d x 1 x 2 x 3 x 4 x 5 x 6 y 1 = 1 y 2 = 1 y 3 = 1 y 4 = − 1 y 5 = − 1 y 6 = − 1
Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d x 1 x 2 x 3 x 4 x 5 x 6 y 1 = 1 y 2 = 1 y 3 = 1 y 4 = − 1 y 5 = − 1 y 6 = − 1 – Neural networks ( n, d > 10 6 ): h ( x, θ ) = θ ⊤ m σ ( θ ⊤ m − 1 σ ( · · · θ ⊤ 2 σ ( θ ⊤ 1 x )) θ 1 θ 2 θ 3 y x
Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • (regularized) empirical risk minimization : n n 1 = 1 � � � � � � min ℓ y i , h ( x i , θ ) + λ Ω( θ ) f i ( θ ) n n θ ∈ R d i =1 i =1 data fitting term + regularizer
Parametric supervised machine learning • Data : n observations ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Prediction function h ( x, θ ) ∈ R parameterized by θ ∈ R d • (regularized) empirical risk minimization : n n 1 = 1 � � � � � � min ℓ y i , h ( x i , θ ) + λ Ω( θ ) f i ( θ ) n n θ ∈ R d i =1 i =1 data fitting term + regularizer • Actual goal : minimize test error E p ( x,y ) ℓ ( y, h ( x, θ ))
Convex optimization problems • Convexity in machine learning – Convex loss and linear predictions h ( x, θ ) = θ ⊤ Φ( x )
Convex optimization problems • Convexity in machine learning – Convex loss and linear predictions h ( x, θ ) = θ ⊤ Φ( x ) • (approximately) Matching theory and practice – Fruitful discussions between theoreticians and practitioners – Quantitative theoretical analysis suggests practical improvements
Convex optimization problems • Convexity in machine learning – Convex loss and linear predictions h ( x, θ ) = θ ⊤ Φ( x ) • (approximately) Matching theory and practice – Fruitful discussions between theoreticians and practitioners – Quantitative theoretical analysis suggests practical improvements • Golden years of convexity in machine learning (1995 to 201*) – Support vector machines and kernel methods – Inference in graphical models – Sparsity / low-rank models with first-order methods – Convex relaxation of unsupervised learning problems – Optimal transport – Stochastic methods for large-scale learning and online learning
Convex optimization problems • Convexity in machine learning – Convex loss and linear predictions h ( x, θ ) = θ ⊤ Φ( x ) • (approximately) Matching theory and practice – Fruitful discussions between theoreticians and practitioners – Quantitative theoretical analysis suggests practical improvements • Golden years of convexity in machine learning (1995 to 201*) – Support vector machines and kernel methods – Inference in graphical models – Sparsity / low-rank models with first-order methods – Convex relaxation of unsupervised learning problems – Optimal transport – Stochastic methods for large-scale learning and online learning
Exponentially convergent SGD for smooth finite sums n n 1 f i ( θ ) = 1 � � � � � � • Finite sums : min ℓ y i , h ( x i , θ ) + λ Ω( θ ) n n θ ∈ R d i =1 i =1
Exponentially convergent SGD for smooth finite sums n n 1 f i ( θ ) = 1 � � � � � � • Finite sums : min ℓ y i , h ( x i , θ ) + λ Ω( θ ) n n θ ∈ R d i =1 i =1 • Non-accelerated algorithms (with similar properties) – SAG (Le Roux, Schmidt, and Bach, 2012) – SDCA (Shalev-Shwartz and Zhang, 2013) – SVRG (Johnson and Zhang, 2013; Zhang et al., 2013) – MISO (Mairal, 2015), Finito (Defazio et al., 2014a) – SAGA (Defazio, Bach, and Lacoste-Julien, 2014b), etc... n ∇ f i ( t ) ( θ t − 1 )+1 � � � y t − 1 − y t − 1 θ t = θ t − 1 − γ i i ( t ) n i =1
Exponentially convergent SGD for smooth finite sums n n 1 f i ( θ ) = 1 � � � � � � • Finite sums : min ℓ y i , h ( x i , θ ) + λ Ω( θ ) n n θ ∈ R d i =1 i =1 • Non-accelerated algorithms (with similar properties) – SAG (Le Roux, Schmidt, and Bach, 2012) – SDCA (Shalev-Shwartz and Zhang, 2013) – SVRG (Johnson and Zhang, 2013; Zhang et al., 2013) – MISO (Mairal, 2015), Finito (Defazio et al., 2014a) – SAGA (Defazio, Bach, and Lacoste-Julien, 2014b), etc... n ∇ f i ( t ) ( θ t − 1 )+1 � � � y t − 1 − y t − 1 θ t = θ t − 1 − γ i i ( t ) n i =1
Exponentially convergent SGD for smooth finite sums n n 1 f i ( θ ) = 1 � � � � � � • Finite sums : min ℓ y i , h ( x i , θ ) + λ Ω( θ ) n n θ ∈ R d i =1 i =1 • Non-accelerated algorithms (with similar properties) – SAG (Le Roux, Schmidt, and Bach, 2012) – SDCA (Shalev-Shwartz and Zhang, 2013) – SVRG (Johnson and Zhang, 2013; Zhang et al., 2013) – MISO (Mairal, 2015), Finito (Defazio et al., 2014a) – SAGA (Defazio, Bach, and Lacoste-Julien, 2014b), etc... • Accelerated algorithms – Shalev-Shwartz and Zhang (2014); Nitanda (2014) – Lin et al. (2015b); Defazio (2016), etc... – Catalyst (Lin, Mairal, and Harchaoui, 2015a)
Exponentially convergent SGD for finite sums • Running-time to reach precision ε (with κ = condition number) � × log 1 Gradient descent d × � nκ � ε � n √ κ � × log 1 Accelerated gradient descent d × � ε
Exponentially convergent SGD for finite sums • Running-time to reach precision ε (with κ = condition number) � 1 Stochastic gradient descent d × � κ × � ε � × log 1 Gradient descent d × � nκ � ε � n √ κ � × log 1 Accelerated gradient descent d × � ε � × log 1 SAG(A), SVRG, SDCA, MISO d × � ( n + κ ) � ε
Exponentially convergent SGD for finite sums • Running-time to reach precision ε (with κ = condition number) � 1 Stochastic gradient descent d × � κ × � ε � × log 1 Gradient descent d × � nκ � ε � n √ κ � × log 1 Accelerated gradient descent d × � ε � × log 1 SAG(A), SVRG, SDCA, MISO d × � ( n + κ ) � ε � ( n + √ nκ ) � � × log 1 Accelerated versions d × � ε NB: slightly different (smaller) notion of condition number for batch methods
Recommend
More recommend