Parallel Optimization in Machine Learning Fabian Pedregosa - PowerPoint PPT Presentation

Parallel Optimization in Machine Learning Fabian Pedregosa December 19, 2017 Huawei Paris Research Center

About me • Engineer (2010-2012), Inria Saclay (scikit-learn kickstart). • PhD (2012-2015), Inria Saclay. • Postdoc (2015-2016), Dauphine–ENS–Inria Paris. • Postdoc (2017-present), UC Berkeley - ETH Zurich (Marie-Curie fellowship, European Commission) Hacker at heart ... trapped in a researcher’s body. 1/32

2006 = no longer mentions to speed of processors. Primary feature: number of cores. Motivation Computer add in 1993 Computer add in 2006 What has changed? 2/32

Primary feature: number of cores. Motivation Computer add in 1993 Computer add in 2006 What has changed? 2006 = no longer mentions to speed of processors. 2/32

Motivation Computer add in 1993 Computer add in 2006 What has changed? 2006 = no longer mentions to speed of processors. Primary feature: number of cores. 2/32

• Multi-core architectures are here to stay. Parallel algorithms needed to take advantage of modern CPUs. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. 3/32

Parallel algorithms needed to take advantage of modern CPUs. 40 years of CPU trends • Speed of CPUs has stagnated since 2005. • Multi-core architectures are here to stay. 3/32

40 years of CPU trends • Speed of CPUs has stagnated since 2005. • Multi-core architectures are here to stay. Parallel algorithms needed to take advantage of modern CPUs. 3/32

Parallel optimization Parallel algorithms can be divided into two large categories: synchronous and asynchronous . Image credits: (Peng et al. 2016) Synchronous methods Asynchronous methods  Easy to implement (i.e.,  Faster, typically larger developed software packages). speedups.  Well understood.  Not well understood, large gap between theory and practice.  Limited speedup due to synchronization costs.  No mature software solutions. 4/32

Outline Synchronous methods • Synchronous (stochastic) gradient descent. Asynchronous methods • Asynchronous stochastic gradient descent (Hogwild) (Niu et al. 2011) • Asynchronous variance-reduced stochastic methods (Leblond, P., and Lacoste-Julien 2017), (Pedregosa, Leblond, and Lacoste-Julien 2017). • Analysis of asynchronous methods. • Codes and implementation aspects. Leaving out many parallel synchronous methods: ADMM (Glowinski and Marroco 1975), CoCoA (Jaggi et al. 2014), DANE (Shamir, Srebro, and Zhang 2014), to name a few. 5/32

Outline Most of the following is joint work with Rémi Leblond and Simon Lacoste-Julien Rémi Leblond Simon Lacoste–Julien 6/32

Synchronous algorithms

Optimization for machine learning Large part of problems in machine learning can be framed as optimization problems of the form n = 1 def ∑ f ( x ) f i ( x ) minimize n x i = 1 Gradient descent (Cauchy 1847). Descend along steepest direction ( −∇ f ( x ) ) x + = x − γ ∇ f ( x ) Stochastic gradient descent (SGD) (Robbins and Monro 1951). Select a random index i and descent along − ∇ f i ( x ) : x + = x − γ ∇ f i ( x ) images source: Francis Bach 7/32

Parallel synchronous gradient descent Computation of gradient is distributed among k workers • Workers can be: different computers, CPUs or GPUs • Popular frameworks: Spark, Tensorflow, PyTorch, neHadoop. 8/32

 Trivial parallelization, same analysis as gradient descent.  Synchronization step every iteration (3.). Parallel synchronous gradient descent 1. Choose n 1 , . . . n k that sum to n . 2. Distribute computation of ∇ f ( x ) among k nodes ∇ f ( x ) = 1 ∑ ∇ f i ( x ) n i = 1 n k n 1 k ( 1 + . . . + 1 = 1 ∑ ∑ ∇ f i ( x ) ∇ f i ( x ) ) n 1 n 1 i = 1 i = n k − 1 � �� done by worker 1 done by worker k 3. Perform the gradient descent update by a master node x + = x − γ ∇ f ( x ) 9/32

Parallel synchronous gradient descent 1. Choose n 1 , . . . n k that sum to n . 2. Distribute computation of ∇ f ( x ) among k nodes ∇ f ( x ) = 1 ∑ ∇ f i ( x ) n i = 1 n k n 1 k ( 1 + . . . + 1 = 1 ∑ ∑ ∇ f i ( x ) ∇ f i ( x ) ) n 1 n 1 i = 1 i = n k − 1 � �� done by worker 1 done by worker k 3. Perform the gradient descent update by a master node x + = x − γ ∇ f ( x )  Trivial parallelization, same analysis as gradient descent.  Synchronization step every iteration (3.). 9/32

 Trivial parallelization, same analysis as (mini-batch) stochastic gradient descent.  The kind of parallelization that is implemented in deep learning libraries (tensorflow, PyTorch, Thano, etc.).  Synchronization step every iteration (3.). Parallel synchronous SGD Can also be extended to stochastic gradient descent. 1. Select k samples i 0 , . . . , i k uniformly at random. 2. Compute in parallel ∇ f i t on worker t 3. Perform the (mini-batch) stochastic gradient descent update k x + = x − γ 1 ∑ ∇ f i t ( x ) k t = 1 10/32

Parallel synchronous SGD Can also be extended to stochastic gradient descent. 1. Select k samples i 0 , . . . , i k uniformly at random. 2. Compute in parallel ∇ f i t on worker t 3. Perform the (mini-batch) stochastic gradient descent update k x + = x − γ 1 ∑ ∇ f i t ( x ) k t = 1  Trivial parallelization, same analysis as (mini-batch) stochastic gradient descent.  The kind of parallelization that is implemented in deep learning libraries (tensorflow, PyTorch, Thano, etc.).  Synchronization step every iteration (3.). 10/32

Asynchronous algorithms

Hogwild (Niu et al. 2011): each core runs SGD in parallel, without synchronization, and updates the same vector of coefficients. In theory : convergence under very strong assumptions. In practice : just works. Asynchronous SGD Synchronization is the bottleneck.  What if we just ignore it? 11/32

Asynchronous SGD Synchronization is the bottleneck.  What if we just ignore it? Hogwild (Niu et al. 2011): each core runs SGD in parallel, without synchronization, and updates the same vector of coefficients. In theory : convergence under very strong assumptions. In practice : just works. 11/32

Hogwild in more detail Each core follows the same procedure 1. Read the information from shared memory ˆ x . 2. Sample i ∈ { 1 , . . . , n } uniformly at random. 3. Compute partial gradient ∇ f i (ˆ x ) . 4. Write the SGD update to shared memory x = x − γ ∇ f i (ˆ x ) . 12/32

Hogwild is fast Hogwild can be very fast. But its still SGD... • With constant step size, bounces around the optimum. • With decreasing step size, slow convergence. • There are better alternatives (Emilie already mentioned some) 13/32

Looking for excitement? ... analyze asynchronous methods!

Analysis of asynchronous methods Simple things become counter-intuitive, e.g, how to name the iterates?  Iterates will change depending on the speed of processors 14/32

Naming scheme in Hogwild Simple, intuitive and wrong Each time a core has finished writing to shared memory, increment iteration counter. x t = ( t + 1 ) -th succesfull update to shared memory. ⇐ ⇒ ˆ Value of ˆ x t and i t are not determined until the iteration has finished. x t and i t are not necessarily independent. ⇒ ˆ = 15/32

Unbiased gradient estimate SGD-like algorithms crucially rely on the unbiased property E i [ ∇ f i ( x )] = ∇ f ( x ) . For synchronous algorithms, follows from the uniform sampling of i n ∑ E i [ ∇ f i ( x )] = Proba ( selecting i ) ∇ f i ( x ) i = 1 n 1 uniform sampling ∑ n ∇ f i ( x ) = ∇ f ( x ) = i = 1 16/32

1 Illustration : problem with two samples and two cores f 2 f 1 f 2 . Computing f 1 is much expensive than f 2 . Start at x 0 . Because of the random sampling there are 4 possible scenarios: 1. Core 1 selects f 1 , Core 2 selects f 1 x 1 x 0 f 1 x 2. Core 1 selects f 1 , Core 2 selects f 2 x 1 x 0 f 2 x 3. Core 1 selects f 2 , Core 2 selects f 1 x 1 x 0 f 2 x 4. Core 1 selects f 2 , Core 2 selects f 2 x 1 x 0 f 2 x So we have 1 3 f i 4 f 1 4 f 2 i 1 1 2 f 1 2 f 2 A problematic example This labeling scheme is incompatible with unbiasedness assumption used in proofs. 17/32

Start at x 0 . Because of the random sampling there are 4 possible scenarios: 1. Core 1 selects f 1 , Core 2 selects f 1 x 1 x 0 f 1 x 2. Core 1 selects f 1 , Core 2 selects f 2 x 1 x 0 f 2 x 3. Core 1 selects f 2 , Core 2 selects f 1 x 1 x 0 f 2 x 4. Core 1 selects f 2 , Core 2 selects f 2 x 1 x 0 f 2 x So we have 1 3 f i 4 f 1 4 f 2 i 1 1 2 f 1 2 f 2 A problematic example This labeling scheme is incompatible with unbiasedness assumption used in proofs. Illustration : problem with two samples and two cores f = 1 2 ( f 1 + f 2 ) . Computing ∇ f 1 is much expensive than ∇ f 2 . 17/32

Parallel Optimization in Machine Learning Fabian Pedregosa - PowerPoint PPT Presentation

Parallel Optimization in Machine Learning Fabian Pedregosa December 19, 2017 Huawei Paris Research Center About me Engineer (2010-2012), Inria Saclay (scikit-learn kickstart). PhD (2012-2015), Inria Saclay. Postdoc (2015-2016),

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Machine Learning for Auto Optimization What is Machine Learning? Definition: Machine

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Local Function Optimization COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

Euro How Big a Difference: Finland and Sweden in Search of Macro Stability * AIECE WG in

Republic, Finland and Norway Rita Asplund (ETLA) Sami Napari (ETLA) INNODRIVE Final Conference

A Solder-Defined Computer Architecture for Backdoor and Malware Resistance Examinee: Marc W.

Distributed Systems CS6421 Networking: SDN and NFV Prof. Tim Wood SDN + NFV Networks are

of Transient Errors Occurring in Processor-based Digital Architectures: Principles and

Transformation of Innovation system in a Small Country elements of Success in Finland Pekka

Completely Reachable Automata: an interplay between semigroups, finite automata, and binary trees

(1.r,1 = ) Ett) It'ut i t flre sc/ufror * R?view - Fu.'?r .f sl5 c'" ar'l

Parallel Optimization in Machine Learning Fabian Pedregosa - PowerPoint PPT Presentation

Parallel Optimization in Machine Learning Fabian Pedregosa December 19, 2017 Huawei Paris Research Center About me Engineer (2010-2012), Inria Saclay (scikit-learn kickstart). PhD (2012-2015), Inria Saclay. Postdoc (2015-2016),

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Machine Learning for Auto Optimization What is Machine Learning? Definition: Machine

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Local Function Optimization COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

Euro How Big a Difference: Finland and Sweden in Search of Macro Stability * AIECE WG in

Republic, Finland and Norway Rita Asplund (ETLA) Sami Napari (ETLA) INNODRIVE Final Conference

A Solder-Defined Computer Architecture for Backdoor and Malware Resistance Examinee: Marc W.

Distributed Systems CS6421 Networking: SDN and NFV Prof. Tim Wood SDN + NFV Networks are

of Transient Errors Occurring in Processor-based Digital Architectures: Principles and

Transformation of Innovation system in a Small Country elements of Success in Finland Pekka

Completely Reachable Automata: an interplay between semigroups, finite automata, and binary trees

(1.r,1 = ) Ett) It'ut i t flre sc/ufror * R?view - Fu.'?r .f sl5 c'&quot; ar'l

(1.r,1 = ) Ett) It'ut i t flre sc/ufror * R?view - Fu.'?r .f sl5 c'" ar'l