Manifold Identification for Ultimately Communication-Efficient - PowerPoint PPT Presentation

Manifold Identification for Ultimately Communication-Efficient Distributed Optimization Yu-Sheng Li Joint work with Wei-Lin Chiang (NTU) and Ching-pei Lee (NUS)

Outline Overview Empirical Risk Minimization The Proposed Algorithm Experiments

Distributed Machine Learning Read 1 MB sequentially from memory 3 µ s Read 1 MB sequentially from network 22 µ s Read 1 MB sequentially from disk (SSD) 49 µ s Round trip in the same datacenter 500 µ s ( Latency Numbers Every Programmer Should Know . 1 ) 1 Originally by Jeff Dean in 2010, updated by Colin Scott at https://colin-scott.github.io/personal_website/research/interactive_latency.html 1

Distributed Machine Learning Read 1 MB sequentially from memory 3 µ s Read 1 MB sequentially from network 22 µ s Read 1 MB sequentially from disk (SSD) 49 µ s Round trip in the same datacenter 500 µ s ( Latency Numbers Every Programmer Should Know . 1 ) ◮ Inter-machine communication may be more time-consuming than local computations within a machine Comm. cost = ( # Comm. rounds ) × ( Bytes communicated per round ) 1 Originally by Jeff Dean in 2010, updated by Colin Scott at https://colin-scott.github.io/personal_website/research/interactive_latency.html 1

Sparsity-inducing Regularization ◮ To avoid overfitting and to force some desired structure of the solution, usually a sparsity-inducing regularizer is introduced 2

Sparsity-inducing Regularization ◮ To avoid overfitting and to force some desired structure of the solution, usually a sparsity-inducing regularizer is introduced ◮ Example: ℓ 2 - vs. ℓ 1 -regularized logistic regression on news20 Relative reg. strength Sparsity of solution Test accuracy ℓ 2 -regularized 2 0 1,355,191 (100%) 99.7449% 2 10 1,355,191 (100%) 97.0044% 2

Sparsity-inducing Regularization ◮ To avoid overfitting and to force some desired structure of the solution, usually a sparsity-inducing regularizer is introduced ◮ Example: ℓ 2 - vs. ℓ 1 -regularized logistic regression on news20 Relative reg. strength Sparsity of solution Test accuracy ℓ 2 -regularized 2 0 1,355,191 (100%) 99.7449% 2 10 1,355,191 (100%) 97.0044% ℓ 1 -regularized 2 0 67,071 (4.95%) 99.7499% 2 2 42,020 (3.10%) 99.7499% 2 4 14,524 (1.07%) 99.7449% 2 6 5,432 (0.40%) 99.6749% 2 8 1,472 (0.11%) 97.3495% 2 10 546 (0.04%) 92.8936% 2

Our contributions Recall: Comm. cost = ( # Comm. rounds ) × ( Bytes communicated per round ) 3

Our contributions Recall: Comm. cost = ( # Comm. rounds ) × ( Bytes communicated per round ) ◮ Focusing on the small subproblem ⇒ fewer bytes to communicate 3

Our contributions Recall: Comm. cost = ( # Comm. rounds ) × ( Bytes communicated per round ) ◮ Focusing on the small subproblem ⇒ fewer bytes to communicate ◮ Acceleration by smooth optimization in the correct manifold ⇒ fewer rounds of communication 3

Results (ours: MADPQN) y-axis: relative distance to the optimal value (log-scaled) x-axis: communication costs (upper), training time (lower) news20 epsilon webspam 10 1 OWLQN OWLQN 10 1 10 1 L-COMM L-COMM 10 4 10 4 10 2 DPLBFGS DPLBFGS MADPQN MADPQN 10 7 10 7 OWLQN 5 10 L-COMM 10 10 10 10 DPLBFGS 10 8 10 13 10 13 MADPQN 0 10 20 0 200 400 0 5 10 Communication ( d bytes) Communication ( d bytes) Communication ( d bytes) 10 1 OWLQN OWLQN 10 1 10 1 L-COMM L-COMM 10 4 10 4 10 2 DPLBFGS DPLBFGS MADPQN MADPQN 10 7 10 7 OWLQN 10 5 10 10 10 10 L-COMM DPLBFGS 10 8 10 13 10 13 MADPQN 0 20 40 60 0 250 500 750 1000 0 1000 2000 Training Time (seconds) Training Time (seconds) Training Time (seconds) 4

Outline Overview Empirical Risk Minimization The Proposed Algorithm Experiments 5

Distributed Empirical Risk Minimization (ERM) ◮ Train a model by minimizing a function that measures the performance on training data K � arg min f ( w ) := f k ( w ) w ∈ R d k =1 ◮ There are K machines, and f k is exclusively available on machine k 6

Distributed Empirical Risk Minimization (ERM) ◮ Train a model by minimizing a function that measures the performance on training data K � arg min f ( w ) := f k ( w ) w ∈ R d k =1 ◮ There are K machines, and f k is exclusively available on machine k ◮ Synchronize w or ∇ f ( w ) by communication: communication cost per iteration is O ( d ) ◮ How to reduce the O ( d ) cost? 6

Sparsity-inducing Regularizer ◮ If w is sparse throughout the training process, we only need to synchronize a shorter vector ◮ Regularized ERM: min f ( w ) + R ( w ) w 7

Sparsity-inducing Regularizer ◮ If w is sparse throughout the training process, we only need to synchronize a shorter vector ◮ Regularized ERM: min f ( w ) + R ( w ) w ◮ An ideal regularization term for forcing sparsity is the ℓ 0 norm: � w � 0 = number of nonzeros in w 7

Sparsity-inducing Regularizer ◮ If w is sparse throughout the training process, we only need to synchronize a shorter vector ◮ Regularized ERM: min f ( w ) + R ( w ) w ◮ An ideal regularization term for forcing sparsity is the ℓ 0 norm: � w � 0 = number of nonzeros in w ◮ But this norm is not continuous and hence hard to optimize ◮ A good surrogate is the ℓ 1 norm � w � 1 = � d i =1 | w i | ◮ Our algorithm works for other partly smooth R , e.g. group-LASSO 7

The Regularized Problem ◮ Now the problem becomes min f ( w ) + � w � 1 , w which is harder to minimize than f ( w ) alone since � w � 1 is not differentiable ◮ As the gradient may not even exist, gradient descent or Newton method cannot be directly applied 8

Proximal Quasi-Newton ◮ Proximal gradient is a simple algorithm that solves w ′ ∇ f ( w ) ⊤ ( w ′ − w ) + 1 2 α � w ′ − w � 2 2 + � w ′ � 1 , min where α is the step size for the current iteration ◮ Each calculation of ∇ f requires one round of communication 9

Proximal Quasi-Newton ◮ Proximal gradient is a simple algorithm that solves w ′ ∇ f ( w ) ⊤ ( w ′ − w ) + 1 2 α � w ′ − w � 2 2 + � w ′ � 1 , min where α is the step size for the current iteration ◮ Each calculation of ∇ f requires one round of communication ◮ To reduce the amount of communication, we include some second-order information: reducing iterations ⇒ reducing rounds of communication ◮ Replace the term � w ′ − w � 2 2 / 2 α with ( w ′ − w ) ⊤ H ( w ′ − w ) / 2 for some H ≈ ∇ 2 f ( w ) 9

Utilizing Sparsity ◮ Even if we only update the nonzero entries of w , if we still compute the whole gradient ∇ f ( w ) , then the communication cost remains O ( d ) 10

Utilizing Sparsity ◮ Even if we only update the nonzero entries of w , if we still compute the whole gradient ∇ f ( w ) , then the communication cost remains O ( d ) ◮ Guess: if w i = 0 at some iteration and it is likely to stay 0 at the next iteration, it remains 0 at the final solution ◮ Then we only solve the subproblem with respect to the coordinates that are likely to be nonzero 10

Utilizing Sparsity ◮ Even if we only update the nonzero entries of w , if we still compute the whole gradient ∇ f ( w ) , then the communication cost remains O ( d ) ◮ Guess: if w i = 0 at some iteration and it is likely to stay 0 at the next iteration, it remains 0 at the final solution ◮ Then we only solve the subproblem with respect to the coordinates that are likely to be nonzero ◮ A progressive shrinking approach: once we guess w i = 0 , we remove those coordinates from our problem in future iterations ◮ So the number of nonzeros in w (i.e. � w � 0 ) gradually decreases 10

Convergence Issue ◮ What if our guess was wrong at some iteration? 11

Convergence Issue ◮ What if our guess was wrong at some iteration? ◮ Need to double-check: when some stopping criterion is met, we restart with all coordinates ◮ Training is terminated only when our model can hardly be improved using all coordinates 11

More Acceleration by Smooth Optimization ◮ | w i | becomes twice-differentiable when w i � = 0 ◮ If the coordinates where w i � = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence 12

More Acceleration by Smooth Optimization ◮ | w i | becomes twice-differentiable when w i � = 0 ◮ If the coordinates where w i � = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence ◮ When the nonzero pattern (manifold) does not change for some iterations, it is likely to be the final pattern 12

More Acceleration by Smooth Optimization ◮ | w i | becomes twice-differentiable when w i � = 0 ◮ If the coordinates where w i � = 0 are fixed, the proximal approach is not needed anymore ◮ The problem can then be transformed into a smooth one for faster convergence ◮ When the nonzero pattern (manifold) does not change for some iterations, it is likely to be the final pattern ◮ Example with d = 5 : { 1 , 2 , 3 , 4 , 5 } 12

Manifold Identification for Ultimately Communication-Efficient - PowerPoint PPT Presentation

Manifold Identification for Ultimately Communication-Efficient Distributed Optimization Yu-Sheng Li Joint work with Wei-Lin Chiang (NTU) and Ching-pei Lee (NUS) Outline Overview Empirical Risk Minimization The Proposed Algorithm Experiments

Game Bot Identification Game Bot Identification based on Manifold Learning based on Manifold

Linear Manifold Clustering Robert Haralick and Rave Harpaz Outline Background The linear

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

Manifold Learning: Applications in Neuroimaging Robin Wolz 23/09/2011 Overview Manifold

HART Board Workshop Ultimately, its about Space Ultimately, its about Space Technology

Manifold-driven spirals and rings Lia Athanassoula LAM, Marseille Lia Athanassoula Manifold

Manifold Regularization Lorenzo Rosasco 9.520 Class 10 March 6, 2011 L. Rosasco Manifold

Charting the Right Manifold: Manifold Mixup for Few-Shot Learning Puneet Mangla 1,2* , Mayank

Manifold Construction and Parameterization for Nonlinear Manifold-Based Model Reduction Chenjie

A manifold structure on the set of functional observers Jochen Trumpf University of W urzburg

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Climate-Poverty-MDGs Nexus: Its Ultimately about People Nicholas Rosellini Deputy Regional

RISK IDENTIFICATION Everything your competitor knows about Risk Identification on Software

Session 12 Assessing and Developing Communication SECTION 4: 1 Communication Communication

HIGH DIMENSIONAL MANIFOLD TOPOLOGY THEN AND NOW Andrew Ranicki (Edinburgh)

rt t rs

Locomotor Transparent Migration of Client-Side Database Code Michael Mior University of

CENTER CONSISTENCY Authors: Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

State-Based Testing Part C Test Cases Generating test cases for complex behaviour Reference:

A Smart Port Card Update John DeHart Washington University jdd@arl.wustl.edu

Conceptual Architecture Sofware Architecture VO (706.706) Roman Kern Version 2.3 Institute for

SmartEntry: Mitigating Routing Update Overhead with Reinforcement Learning for Traffic Engineering

Ensuring high quality clinical placements through appropriate mentoring and support Dr Mahendra G