parallel optimization in machine learning
play

Parallel Optimization in Machine Learning Fabian Pedregosa - PowerPoint PPT Presentation

Parallel Optimization in Machine Learning Fabian Pedregosa December 19, 2017 Huawei Paris Research Center About me Engineer (2010-2012), Inria Saclay (scikit-learn kickstart). PhD (2012-2015), Inria Saclay. Postdoc (2015-2016),


  1. Parallel Optimization in Machine Learning Fabian Pedregosa December 19, 2017 Huawei Paris Research Center

  2. About me • Engineer (2010-2012), Inria Saclay (scikit-learn kickstart). • PhD (2012-2015), Inria Saclay. • Postdoc (2015-2016), Dauphine–ENS–Inria Paris. • Postdoc (2017-present), UC Berkeley - ETH Zurich (Marie-Curie fellowship, European Commission) Hacker at heart ... trapped in a researcher’s body. 1/32

  3. 2006 = no longer mentions to speed of processors. Primary feature: number of cores. Motivation Computer add in 1993 Computer add in 2006 What has changed? 2/32

  4. Primary feature: number of cores. Motivation Computer add in 1993 Computer add in 2006 What has changed? 2006 = no longer mentions to speed of processors. 2/32

  5. Motivation Computer add in 1993 Computer add in 2006 What has changed? 2006 = no longer mentions to speed of processors. Primary feature: number of cores. 2/32

  6. • Multi-core architectures are here to stay. Parallel algorithms needed to take advantage of modern CPUs. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. 3/32

  7. Parallel algorithms needed to take advantage of modern CPUs. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • Multi-core architectures are here to stay. 3/32

  8. Parallel algorithms needed to take advantage of modern CPUs. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • Multi-core architectures are here to stay. 3/32

  9. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • Multi-core architectures are here to stay. Parallel algorithms needed to take advantage of modern CPUs. 3/32

  10. Parallel optimization Parallel algorithms can be divided into two large categories: synchronous and asynchronous . Image credits: (Peng et al. 2016) Synchronous methods Asynchronous methods  Easy to implement (i.e.,  Faster, typically larger developed software packages). speedups.  Well understood.  Not well understood, large gap between theory and practice.  Limited speedup due to synchronization costs.  No mature software solutions. 4/32

  11. Outline Synchronous methods • Synchronous (stochastic) gradient descent. Asynchronous methods • Asynchronous stochastic gradient descent (Hogwild) (Niu et al. 2011) • Asynchronous variance-reduced stochastic methods (Leblond, P., and Lacoste-Julien 2017), (Pedregosa, Leblond, and Lacoste-Julien 2017). • Analysis of asynchronous methods. • Codes and implementation aspects. Leaving out many parallel synchronous methods: ADMM (Glowinski and Marroco 1975), CoCoA (Jaggi et al. 2014), DANE (Shamir, Srebro, and Zhang 2014), to name a few. 5/32

  12. Outline Most of the following is joint work with Rémi Leblond and Simon Lacoste-Julien Rémi Leblond Simon Lacoste–Julien 6/32

  13. Synchronous algorithms

  14. Optimization for machine learning Large part of problems in machine learning can be framed as optimization problems of the form n = 1 def ∑ f ( x ) f i ( x ) minimize n x i = 1 Gradient descent (Cauchy 1847). Descend along steepest direction ( −∇ f ( x ) ) x + = x − γ ∇ f ( x ) Stochastic gradient descent (SGD) (Robbins and Monro 1951). Select a random index i and descent along − ∇ f i ( x ) : x + = x − γ ∇ f i ( x ) images source: Francis Bach 7/32

  15. Parallel synchronous gradient descent Computation of gradient is distributed among k workers • Workers can be: different computers, CPUs or GPUs • Popular frameworks: Spark, Tensorflow, PyTorch, neHadoop. 8/32

  16.  Trivial parallelization, same analysis as gradient descent.  Synchronization step every iteration (3.). Parallel synchronous gradient descent 1. Choose n 1 , . . . n k that sum to n . 2. Distribute computation of ∇ f ( x ) among k nodes ∇ f ( x ) = 1 ∑ ∇ f i ( x ) n i = 1 n k n 1 k ( 1 + . . . + 1 = 1 ∑ ∑ ∇ f i ( x ) ∇ f i ( x ) ) n 1 n 1 i = 1 i = n k − 1 � �� � � �� � done by worker 1 done by worker k 3. Perform the gradient descent update by a master node x + = x − γ ∇ f ( x ) 9/32

  17. Parallel synchronous gradient descent 1. Choose n 1 , . . . n k that sum to n . 2. Distribute computation of ∇ f ( x ) among k nodes ∇ f ( x ) = 1 ∑ ∇ f i ( x ) n i = 1 n k n 1 k ( 1 + . . . + 1 = 1 ∑ ∑ ∇ f i ( x ) ∇ f i ( x ) ) n 1 n 1 i = 1 i = n k − 1 � �� � � �� � done by worker 1 done by worker k 3. Perform the gradient descent update by a master node x + = x − γ ∇ f ( x )  Trivial parallelization, same analysis as gradient descent.  Synchronization step every iteration (3.). 9/32

  18.  Trivial parallelization, same analysis as (mini-batch) stochastic gradient descent.  The kind of parallelization that is implemented in deep learning libraries (tensorflow, PyTorch, Thano, etc.).  Synchronization step every iteration (3.). Parallel synchronous SGD Can also be extended to stochastic gradient descent. 1. Select k samples i 0 , . . . , i k uniformly at random. 2. Compute in parallel ∇ f i t on worker t 3. Perform the (mini-batch) stochastic gradient descent update k x + = x − γ 1 ∑ ∇ f i t ( x ) k t = 1 10/32

  19. Parallel synchronous SGD Can also be extended to stochastic gradient descent. 1. Select k samples i 0 , . . . , i k uniformly at random. 2. Compute in parallel ∇ f i t on worker t 3. Perform the (mini-batch) stochastic gradient descent update k x + = x − γ 1 ∑ ∇ f i t ( x ) k t = 1  Trivial parallelization, same analysis as (mini-batch) stochastic gradient descent.  The kind of parallelization that is implemented in deep learning libraries (tensorflow, PyTorch, Thano, etc.).  Synchronization step every iteration (3.). 10/32

  20. Asynchronous algorithms

  21. Hogwild (Niu et al. 2011): each core runs SGD in parallel, without synchronization, and updates the same vector of coefficients. In theory : convergence under very strong assumptions. In practice : just works. Asynchronous SGD Synchronization is the bottleneck.  What if we just ignore it? 11/32

  22. Asynchronous SGD Synchronization is the bottleneck.  What if we just ignore it? Hogwild (Niu et al. 2011): each core runs SGD in parallel, without synchronization, and updates the same vector of coefficients. In theory : convergence under very strong assumptions. In practice : just works. 11/32

  23. Hogwild in more detail Each core follows the same procedure 1. Read the information from shared memory ˆ x . 2. Sample i ∈ { 1 , . . . , n } uniformly at random. 3. Compute partial gradient ∇ f i (ˆ x ) . 4. Write the SGD update to shared memory x = x − γ ∇ f i (ˆ x ) . 12/32

  24. Hogwild is fast Hogwild can be very fast. But its still SGD... • With constant step size, bounces around the optimum. • With decreasing step size, slow convergence. • There are better alternatives (Emilie already mentioned some) 13/32

  25. Looking for excitement? ... analyze asynchronous methods!

  26. Analysis of asynchronous methods Simple things become counter-intuitive, e.g, how to name the iterates?  Iterates will change depending on the speed of processors 14/32

  27. Naming scheme in Hogwild Simple, intuitive and wrong Each time a core has finished writing to shared memory, increment iteration counter. x t = ( t + 1 ) -th succesfull update to shared memory. ⇐ ⇒ ˆ Value of ˆ x t and i t are not determined until the iteration has finished. x t and i t are not necessarily independent. ⇒ ˆ = 15/32

  28. Unbiased gradient estimate SGD-like algorithms crucially rely on the unbiased property E i [ ∇ f i ( x )] = ∇ f ( x ) . For synchronous algorithms, follows from the uniform sampling of i n ∑ E i [ ∇ f i ( x )] = Proba ( selecting i ) ∇ f i ( x ) i = 1 n 1 uniform sampling ∑ n ∇ f i ( x ) = ∇ f ( x ) = i = 1 16/32

  29. 1 Illustration : problem with two samples and two cores f 2 f 1 f 2 . Computing f 1 is much expensive than f 2 . Start at x 0 . Because of the random sampling there are 4 possible scenarios: 1. Core 1 selects f 1 , Core 2 selects f 1 x 1 x 0 f 1 x 2. Core 1 selects f 1 , Core 2 selects f 2 x 1 x 0 f 2 x 3. Core 1 selects f 2 , Core 2 selects f 1 x 1 x 0 f 2 x 4. Core 1 selects f 2 , Core 2 selects f 2 x 1 x 0 f 2 x So we have 1 3 f i 4 f 1 4 f 2 i 1 1 2 f 1 2 f 2 A problematic example This labeling scheme is incompatible with unbiasedness assumption used in proofs. 17/32

  30. Start at x 0 . Because of the random sampling there are 4 possible scenarios: 1. Core 1 selects f 1 , Core 2 selects f 1 x 1 x 0 f 1 x 2. Core 1 selects f 1 , Core 2 selects f 2 x 1 x 0 f 2 x 3. Core 1 selects f 2 , Core 2 selects f 1 x 1 x 0 f 2 x 4. Core 1 selects f 2 , Core 2 selects f 2 x 1 x 0 f 2 x So we have 1 3 f i 4 f 1 4 f 2 i 1 1 2 f 1 2 f 2 A problematic example This labeling scheme is incompatible with unbiasedness assumption used in proofs. Illustration : problem with two samples and two cores f = 1 2 ( f 1 + f 2 ) . Computing ∇ f 1 is much expensive than ∇ f 2 . 17/32

Recommend


More recommend