Towards More Efficient Distributed Machine Learning Jialei Wang University of Chicago ISE, NCSU, 12/13/2017 1/41
The empirical success of machine learning 2/41
The empirical success of machine learning Big Data Advanced Massive Modeling Computing 2/41
My research Variance Dual reduction Alternating Minibatch Prox Primal-dual methods Opt&Sketch Distributed Sparsity Sketching Efficient ML Potfolio Opt Confidence- Weighted Applications Online Collaborative ranking Budget OGD Cloud removal Cost-sensitive 3/41
My research Variance Dual reduction Alternating Minibatch Prox Primal-dual methods Opt&Sketch Distributed Sparsity Sketching Efficient ML Potfolio Opt Confidence- Weighted Applications Online Collaborative ranking Budget OGD Cloud removal Cost-sensitive 3/41
My research Variance Dual reduction Alternating Minibatch Prox Primal-dual methods Opt&Sketch Distributed Sparsity Sketching Efficient ML Potfolio Opt Confidence- Weighted Applications Online Collaborative ranking Budget OGD Cloud removal Cost-sensitive 3/41
This talk Variance Dual reduction Alternating Minibatch Prox Primal-dual methods Opt&Sketch Distributed Sparsity Sketching Efficient ML Potfolio Opt Confidence- Weighted Applications Online Collaborative ranking Budget OGD Cloud removal Cost-sensitive 4/41
Motivation for Distributed Learning Data Size § Data cannot be stored or processed on a single machine. § Use distributed computing to handle big data sets. § Example: Click-through rate prediction problem. 5/41
Motivation for Distributed Learning 5/41
Motivation for Distributed Learning 5/41
Motivation for Distributed Learning Data Collection § Data are naturally distributed on different machines. § Use distributed computing to learn from decentralized data. § Example: Google’s federated learning problem. 5/41
Motivation for Distributed Learning 5/41
Motivation for Distributed Learning 5/41
Challenges in Distributed Learning Efficiency in multiple dimensions § Sample: sample complexity matches the centralized solution. § Computation: floating point operations. § Communication: bandwidth (number of bits transmitted) + latency (rounds of communication). § Memory etc. 6/41
Challenges in Distributed Learning Efficiency in multiple dimensions § Sample: sample complexity matches the centralized solution. § Computation: floating point operations. § Communication: bandwidth (number of bits transmitted) + latency (rounds of communication). § Memory etc. latency " bandwidth " FLOPS 6/41
Learning as Optimization Stochastic Optimization Problems min w P Ω F p w q : “ E z „ D r ℓ p w , z qs . 7/41
Learning as Optimization Stochastic Optimization Problems min w P Ω F p w q : “ E z „ D r ℓ p w , z qs . 7/41
Learning as Optimization Stochastic Optimization Problems min w P Ω F p w q : “ E z „ D r ℓ p w , z qs . 200 190 180 170 160 150 40 50 60 70 80 90 100 7/41
Learning as Optimization Stochastic Optimization Problems min w P Ω F p w q : “ E z „ D r ℓ p w , z qs . Input Hidden Output layer layer layer Input #1 Input #2 Output Input #3 Input #4 7/41
Distributed Optimization for Learning Reduction from (Distributed) Learning to Optimization § m machines, each machine collect n data instances t z ij u n j “ 1 . ´ ¯ ř m ř n § Global Objective: min w f p w q : “ 1 1 j ℓ p w , z ij q . i “ 1 m n § Distributed Consensus: f i p w q : “ 1 ř n f p w q : “ 1 ř m j ℓ p w , z ij q , i “ 1 f i p w q . n m 8/41
Distributed Optimization for Learning Reduction from (Distributed) Learning to Optimization § m machines, each machine collect n data instances t z ij u n j “ 1 . ´ ¯ § Global Objective: min w f p w q : “ 1 ř m 1 ř n j ℓ p w , z ij q . i “ 1 m n § Distributed Consensus: ř n ř m f i p w q : “ 1 f p w q : “ 1 j ℓ p w , z ij q , i “ 1 f i p w q . n m f 1 p w q f 2 p w q f 3 p w q f 4 p w q f 5 p w q 8/41
Distributed Optimization for Learning What’s special about Machine Learning ? § Learning care about the population objective F p w q “ E z „ D r ℓ p w , z qs . § Stochastic nature of the data: local objectives f i p w q are related. 9/41
Distributed Optimization for Learning What’s special about Machine Learning ? § Learning care about the population objective F p w q “ E z „ D r ℓ p w , z qs . § Stochastic nature of the data: local objectives f i p w q are related. ř n j ℓ p w , z 1 j q ř n j ℓ p w , z 2 j q ř n j ℓ p w , z 3 j q ř n j ℓ p w , z 4 j q ř n j ℓ p w , z 5 j q 9/41
Distributed Optimization for Learning What’s special about Machine Learning ? § Learning care about the population objective F p w q “ E z „ D r ℓ p w , z qs . § Stochastic nature of the data: local objectives f i p w q are related. t z 1 j u n t z 2 j u n t z 3 j u n t z 4 j u n t z 5 j u n j “ 1 „ D j “ 1 „ D j “ 1 „ D j “ 1 „ D j “ 1 „ D 9/41
Distributed Optimization for Learning What’s special about Machine Learning ? § Learning care about the population objective F p w q “ E z „ D r ℓ p w , z qs . § Stochastic nature of the data: local objectives f i p w q are related. t z 1 j u n t z 2 j u n t z 3 j u n t z 4 j u n t z 5 j u n j “ 1 „ D j “ 1 „ D j “ 1 „ D j “ 1 „ D j “ 1 „ D How to exploit similarity/relatedness between machines when designing distributed learning algorithms ? 9/41
This talk: two specific problems 1. How to efficiently learn sparse linear predictors in distributed environment ? 10/41
This talk: two specific problems 1. How to efficiently learn sparse linear predictors in distributed environment ? 2. How to parallelize stochastic gradient descent(SGD) ? 10/41
Efficient Distributed Learning with Sparsity International Conference on Machine Learning (ICML), 2017. Joint work with Mladen Kolar Nathan Srebro Tong Zhang 11/41
High-level Overview Problem Efficient Distributed Sparse Learning with Optimal Statistical Accuracy. 12/41
High-level Overview Problem Efficient Distributed Sparse Learning with Optimal Statistical Accuracy. Sparse Learning in High Dimension § On a single machine, use classical methods such as Lasso. § Statistical accuracy versus computation. 12/41
High-level Overview Problem Efficient Distributed Sparse Learning with Optimal Statistical Accuracy. Sparse Learning in High Dimension § On a single machine, use classical methods such as Lasso. § Statistical accuracy versus computation. Distributed Learning with Big Data § Data are distributed on multiple machines. § Statistical accuracy versus computation and communication . loooooooooooooooooooomoooooooooooooooooooon efficiency 12/41
High-dimensional Sparse Model Number of variables ( p ) is often very large. ... ... GTGCATCTGACTCCTGAGGAGTAG ... Genotype ... CACGTAGACTGAGGACTCCTCATC predict 2.5(Phenotype) 13/41
High-dimensional Sparse Model Number of variables ( p ) is often very large. 13/41
High-dimensional Sparse Model Number of variables ( p ) is often very large. Sparsity § Only a few variables are predictive. § w ˚ “ arg min w E x , y „ D r ℓ p y , x x , w yqs . § S : “ support p w ˚ q “ t j P r p s | w j ‰ 0 u and s “ | S | ! p . 13/41
High-dimensional Sparse Model Number of variables ( p ) is often very large. Sparsity § Only a few variables are predictive. § w ˚ “ arg min w E x , y „ D r ℓ p y , x x , w yqs . § S : “ support p w ˚ q “ t j P r p s | w j ‰ 0 u and s “ | S | ! p . ℓ 1 regularization (Tibshirani, 1996; Chen et al., 1998) § Statistical accuracy: good statistical properties. § Computational efficiency: Convex surrogate of ℓ 0 . 13/41
Sparse Regression Statistical Model § y “ x x , w ˚ y ` noise . Centralized ℓ 1 regularization m n 1 ÿ ÿ w cent “ arg min � ℓ p y ij , x x ij , w yq ` λ || w || 1 . mn w i “ 1 j “ 1 ˆb ˙ s log p w cent ´ w ˚ || 2 “ O (Optimal) Statistical accuracy: || � . mn Efficient method achieving optimal statistical accuracy ? 14/41
This work A communication and computation-efficient approach To achieve optimal statistical accuracy: n Á ms 2 log p Approach Communication Computation Centralize n ¨ p T lasso p mn , p q Avg-Debias p p ¨ T lasso p n , p q This work p 2 ¨ T lasso p n , p q 15/41
This work A communication and computation-efficient approach To achieve optimal statistical accuracy: n Á ms 2 log p Approach Communication Computation Centralize n ¨ p T lasso p mn , p q Avg-Debias p p ¨ T lasso p n , p q This work p 2 ¨ T lasso p n , p q ms 2 log p Á n Á s 2 log p Approach Communication Computation Centralize n ¨ p T lasso p mn , p q Avg-Debias ˆ ˆ This work log m ¨ p log m ¨ T lasso p n , p q T lasso p n , p q : runtime for solving a lasso problem of size n ˆ p . 15/41
The Proposed Approach Step 0: Local ℓ 1 Regularized Problem Solve w 1 “ arg min f 1 p w q ` λ 1 || w || 1 . � 16/41
The Proposed Approach Step 1,...,t: Shifted ℓ 1 Regularized Problem Communicate � w t and local gradient. ∇ f 1 p � ∇ f 2 p � ∇ f 3 p � ∇ f 4 p � ∇ f 5 p � w t q w t q w t q w t q w t q 16/41
Recommend
More recommend