Stochastic gradient descent on Riemannian manifolds Silvre Bonnabel - PowerPoint PPT Presentation

Stochastic gradient descent on Riemannian manifolds Silvère Bonnabel 1 Centre de Robotique - Mathématiques et systèmes “Ecole des Mines de Paris" 2 Journées du GDR ISIS Paris, 20 novembre 2014 1 silvere.bonnabel@mines-paristech 2 Mines ParisTech, PSL Research University

Introduction • We proposed a stochastic gradient algorithm on a specific manifold for matrix regression in: • Regression on fixed-rank positive semidefinite matrices: a Riemannian approach , Meyer, Bonnabel and Sepulchre, Journal of Machine Learning Research, 2011. • Compete(ed) with (then) state of the art for low-rank Mahalanobis distance and kernel learning • Convergence then left as an open question • The material of today’s presentation is the paper Stochastic gradient descent on Riemannian manifolds , IEEE Trans. on Automatic Control, 2013. • Bottou and Bousquet have recently popularized SGD in machine learning as randomly picking the data is a way to handle ever-increasing datasets.

Outline 1 Stochastic gradient descent • Introduction and examples • Standard convergence analysis (due to L. Bottou) 2 Stochastic gradient descent on Riemannian manifolds • Introduction • Results 3 Examples

Classical example Linear regression: Consider the linear model y = x T w + ν where x , w ∈ R d and y ∈ R and ν ∈ R a noise. • examples: z = ( x , y ) • loss (prediction error): y ) 2 = ( y − x T w ) 2 Q ( z , w ) = ( y − ˆ � • cannot minimize expected risk C ( w ) = Q ( z , w ) dP ( z ) � n • minimize empirical risk instead ˆ C n ( w ) = 1 i = 1 Q ( z i , w ) . n

Gradient descent Batch gradient descent : process all examples together � � n 1 � w t + 1 = w t − γ t ∇ w Q ( z i , w t ) n i = 1 Stochastic gradient descent : process examples one by one w t + 1 = w t − γ t ∇ w Q ( z t , w t ) for some random example z t = ( x t , y t ) .

Gradient descent Batch gradient descent : process all examples together � � n 1 � w t + 1 = w t − γ t ∇ w Q ( z i , w t ) n i = 1 Stochastic gradient descent : process examples one by one w t + 1 = w t − γ t ∇ w Q ( z t , w t ) for some random example z t = ( x t , y t ) . ⇒ well known identification algorithm for Wiener- ARMAX systems n m � � b i u t − i + v t = ψ T y t = a i y t − i + t w + v t , 1 1 Q ( y t , w t ) = ( y t − ψ T t w t ) 2

Stochastic versus online Stochastic : examples drawn randomly from a finite set • SGD minimizes the empirical risk Online : examples drawn with unknown dP ( z ) • SGD minimizes the expected risk (+ tracking property) Stochastic approximation: approximate a sum by a stream of single elements

Stochastic versus batch SGD can converge very slowly : for a long sequence ∇ w Q ( z t , w t ) may be a very bad approximation of � � n 1 � ∇ w ˆ C n ( w t ) = ∇ w Q ( z i , w t ) n i = 1 SGD can converge very fast when there is redundancy • extreme case z 1 = z 2 = · · ·

Some examples Least mean squares: y ) 2 = ( y − x T w ) • Loss: Q ( z , w ) = ( y − ˆ • Update: w t + 1 = w t − γ t ∇ w Q ( z t , w t ) = w t − γ t ( y t − ˆ y t ) x t Robbins-Monro algorithm (1951): C smooth with a unique minimum ⇒ the algorithm converges in L 2 k-means: McQueen (1967) • Procedure: pick z t , attribute it to w k • Update: w k t + 1 = w k t + γ t ( z t − w k t )

Some examples Ballistics example (old). Early adaptive control • optimize the trajectory of a projectile in fluctuating wind • successive gradient corrections on the launching angle • with γ t → 0 it will stabilize to an optimal value

Another example: mean � Computing a mean : Total loss 1 i � z i − w � 2 n � Minimum : w − 1 i z i = 0 i.e. w is the mean of the points z i n Stochastic gradient : w t + 1 = w t − γ t ( w t − z i ) where z i randomly picked 3 3 what if �� is replaced with some more exotic distance ?

Outline 1 Stochastic gradient descent • Introduction and examples • Standard convergence analysis (due to L. Bottou) 2 Stochastic gradient descent on Riemannian manifolds • Introduction • Results 3 Examples

Notation Expected cost : � C ( w ) := E z ( Q ( z , w )) = Q ( z , w ) dP ( z ) Approximated gradient under the event z denoted by H ( z , w ) � E z H ( z , w ) = ∇ ( Q ( z , w ) dP ( z )) = ∇ C ( w ) Stochastic gradient update : w t + 1 ← w t − γ t H ( z t , w t )

Convergence results Convex case : known as Robbins-Monro algorithm. Convergence to the global minimum of C ( w ) in mean, and almost surely. Nonconvex case . C ( w ) is generally not convex. We are interested in proving • almost sure convergence • a.s. convergence of C ( w t ) • ... to a local minimum • ∇ C ( w t ) → 0 a.s. Provable under a set of reasonable assumptions

Assumptions Step sizes : the steps must decrease. Classically � � γ 2 t < ∞ γ t = + ∞ and The sequence γ t = t − α , provides examples for 1 2 < α ≤ 1. Cost regularity : averaged loss C ( w ) 3 times differentiable (relaxable). Sketch of the proof 1 confinement: w t remains a.s. in a compact. 2 convergence: ∇ C ( w t ) → 0 a.s.

Confinement Main difficulties: 1 Only an approximation of the cost is available 2 We are in discrete time Approximation : the noise can generate unbounded trajectories with small but nonzero probability. Discrete time : even without noise yields difficulties as there is no line search. SO ? : confinement to a compact holds under a set of assumptions: well, see the paper 4 ... 4 L. Bottou: Online Algorithms and Stochastic Approximations. 1998.

Convergence (simplified) Confinement • All trajectories can be assumed to remain in a compact set • All continuous functions of w t are bounded Convergence Letting h t = C ( w t ) > 0, second order Taylor expansion: h t + 1 − h t ≤ − 2 γ t H ( z t , w t ) ∇ C ( w t ) + γ 2 t � H ( z t , w t ) � 2 K 1 with K 1 upper bound on ∇ 2 C and � H ( z t , w t ) � 2 < A .

Convergence (in a nutshell) We have just proved h t + 1 − h t ≤ − 2 γ t H ( z t , w t ) ∇ C ( w t ) + γ 2 t AK 1

Convergence (in a nutshell) We have just proved h t + 1 − h t ≤ − 2 γ t H ( z t , w t ) ∇ C ( w t ) + γ 2 t AK 1 Conditioning w.r.t. F t = { z 0 , · · · , z t − 1 , w 0 , · · · , w t } and letting ∞ � γ 2 AK 1 ≥ 0 g t := h t + t we have E [ g t + 1 − g t | F t ] ≤ − 2 γ t �∇ C ( w t ) � 2 . � �� this term ≤ 0

Convergence (in a nutshell) We have just proved h t + 1 − h t ≤ − 2 γ t H ( z t , w t ) ∇ C ( w t ) + γ 2 t AK 1 Conditioning w.r.t. F t = { z 0 , · · · , z t − 1 , w 0 , · · · , w t } and letting ∞ � γ 2 AK 1 ≥ 0 g t := h t + t we have E [ g t + 1 − g t | F t ] ≤ − 2 γ t �∇ C ( w t ) � 2 . � �� this term ≤ 0 Thus g t supermartingale so it converges a.s. and � 2 γ t �∇ C ( w t ) � 2 < ∞ 0 ≤ t As � γ t = ∞ we have ∇ C ( w t ) converges a.s. to 0.

Outline 1 Stochastic gradient descent • Introduction and examples • Standard convergence analysis 2 Stochastic gradient descent on Riemannian manifolds • Introduction • Results 3 Examples

Connected Riemannian manifold Riemannian manifold : local coordinates around any point Tangent space : Riemmanian metric : scalar product � u , v � g on the tangent space

Riemannian manifolds Riemannian manifold carries the structure of a metric space whose distance function is the arclength of a minimizing path between two points. Length of a curve c ( t ) ∈ M � b � b � � ˙ c ( t ) , ˙ c ( t )) � g dt = � ˙ c ( t ) � dt L = a a Geodesic : curve of minimal length joining sufficiently close x and y . Exponential map : exp x ( v ) is the point z ∈ M situated on the geodesic with initial position-velocity ( x , v ) at distance � v � of x .

Consider f : M → R twice differentiable. Riemannian gradient : tangent vector at x satisfying d dt | t = 0 f ( exp x ( tv )) = � v , ∇ f ( x ) � g

Consider f : M → R twice differentiable. Riemannian gradient : tangent vector at x satisfying d dt | t = 0 f ( exp x ( tv )) = � v , ∇ f ( x ) � g Riemannian Hessian : based on the Taylor expansion f ( exp x ( tv )) = t � v , ∇ f ( x ) � g + 1 2 t 2 v T [ Hess f ( x )] v + O ( t 3 )

Consider f : M → R twice differentiable. Riemannian gradient : tangent vector at x satisfying d dt | t = 0 f ( exp x ( tv )) = � v , ∇ f ( x ) � g Riemannian Hessian : based on the Taylor expansion f ( exp x ( tv )) = t � v , ∇ f ( x ) � g + 1 2 t 2 v T [ Hess f ( x )] v + O ( t 3 ) Second order Taylor expansion : f ( exp x ( tv )) − f ( x ) ≤ t � v , ∇ f ( x ) � g + t 2 2 � v � 2 g k where k is a bound on the hessian along the geodesic.

Riemannian SGD on M Riemannian approximated gradient : E z ( H ( z t , w t )) = ∇ C ( w t ) a tangent vector ! Stochastic gradient descent on M : update w t + 1 ← exp w t ( − γ t H ( z t , w t )) w t + 1 must remain on M !

Outline 1 Stochastic gradient descent • Introduction and examples • Standard convergence analysis 2 Stochastic gradient descent on Riemannian manifolds • Introduction • Results 3 Examples

Stochastic gradient descent on Riemannian manifolds Silvre Bonnabel - PowerPoint PPT Presentation

Stochastic gradient descent on Riemannian manifolds Silvre Bonnabel 1 Centre de Robotique - Mathmatiques et systmes Ecole des Mines de Paris" 2 Journes du GDR ISIS Paris, 20 novembre 2014 1 silvere.bonnabel@mines-paristech 2

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Riemannian manifolds with nontrivial Limbeek local symmetry Wouter van Limbeek University of

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Conformality and Q harmonicity in sub-Riemannian manifolds Joint work L.C., Enrico Le Donne

1 University of Saskatchewan, Canada 2 University of Tartu, Estonia

Adaptive mesh redistribution on the sphere for global atmospheric modelling Phil Browne &

This document must be cited according to its fjnal version which is published in a conference as:

Adaptive Estimation of the Distribution Function and its Density in Sup-Norm Loss Evarist Gin

Anomaly Detection Fault Tolerance Anticipation Patterns John Allspaw SVP, Tech Ops Qcon

0 The first problem is of course speed: light transport is generally slow, transport in volumes

On Adaptive Strategies and Convex Optimization Algorithms Joon Kwon joint work with Panayotis

Quality-adaptive Prefetching for Interactive Branched Video using HTTP-based Adaptive Streaming

Stochastic gradient descent on Riemannian manifolds Silvre Bonnabel - PowerPoint PPT Presentation

Stochastic gradient descent on Riemannian manifolds Silvre Bonnabel 1 Centre de Robotique - Mathmatiques et systmes Ecole des Mines de Paris" 2 Journes du GDR ISIS Paris, 20 novembre 2014 1 silvere.bonnabel@mines-paristech 2

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Riemannian manifolds with nontrivial Limbeek local symmetry Wouter van Limbeek University of

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

Large Scale Machine Learning with Stochastic Gradient Descent L eon Bottou leon@bottou.org

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Conformality and Q harmonicity in sub-Riemannian manifolds Joint work L.C., Enrico Le Donne

1 University of Saskatchewan, Canada 2 University of Tartu, Estonia

Adaptive mesh redistribution on the sphere for global atmospheric modelling Phil Browne &amp;

This document must be cited according to its fjnal version which is published in a conference as:

Adaptive Estimation of the Distribution Function and its Density in Sup-Norm Loss Evarist Gin

Anomaly Detection Fault Tolerance Anticipation Patterns John Allspaw SVP, Tech Ops Qcon

0 The first problem is of course speed: light transport is generally slow, transport in volumes

On Adaptive Strategies and Convex Optimization Algorithms Joon Kwon joint work with Panayotis

Quality-adaptive Prefetching for Interactive Branched Video using HTTP-based Adaptive Streaming

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Adaptive mesh redistribution on the sphere for global atmospheric modelling Phil Browne &