heterogeneous compute environments
play

Heterogeneous Compute Environments Celestine Dnner Martin Jaggi - PowerPoint PPT Presentation

High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dnner Martin Jaggi (EPFL) Thomas Parnell, Kubilay Atasu, Manolis Sifalakis, Dimitrios Sarigiannis, Haris Pozidis (IBM Research) Motivation of Our Work


  1. High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas Parnell, Kubilay Atasu, Manolis Sifalakis, Dimitrios Sarigiannis, Haris Pozidis (IBM Research)

  2. Motivation of Our Work Distributed Training of Large-Scale Linear Models  fast training  interpretable models  training on large-scale datasets

  3. Motivation of Our Work Distributed Training of Large-Scale Linear Models Choose an Algorithm Choose an Implementation Choose an Infrastructure ? How does the infrastructure and the implementation impact the performance of the algorithm?

  4. Motivation of Our Work Distributed Training of Large-Scale Linear Models Choose an Algorithm Choose an Implementation Choose an Infrastructure ? How can algorithms be optimized and implemented to achieve optimal performance on a given system?

  5. Algorithmic Challenge of Distributed Learning 𝑔 𝐵 ⊤ 𝒙 + 𝑕(𝒙) min 𝐱

  6. Algorithmic Challenge of Distributed Learning • The more frequently you exchange information the 1 1 faster your model converges 4 Trade-off 2 • Communication over the aggregate 3 network can be very expensive local models 1 1 1 𝑔 𝐵 ⊤ 𝒙 + 𝑕(𝒙) min 𝐱

  7. CoCoA Framework [ SMITH, JAGGI, TAKAC, MA, FORTE, HOFMANN, JORDAN , 2013-2015] Tunable hyper-parameter H H steps of H steps of local solver local solver H* depends on: system • Implementation/ • framework H steps of local solver H steps of local solver H steps of 𝑔 𝐵 ⊤ 𝒙 + 𝑕(𝒙) local solver min 𝐱

  8. Implementation: Frameworks for Distributed Computing * http://spark.apache.org/ MPI High-Performance Computing Framework Open Source Cloud Computing Framework Requires advanced system knowledge • Easy-to-use • C++ • Powerful APIs : Python, Scala, Java, R • • Good performance Poorly understood overheads • Designed for different purposes different characteristics Apache Spark, Spark and the Spark logo are trademarks of the Apache Software foundation (ASF). Other product or service names may be trademarks or service of IBM or other companies

  9. webspam dataset Different implementations of CoCoA 8 Spark workers • (A) Spark Reference Implementation* • (B) pySpark Implementation • (C) MPI Implementation Offload local solver to C++ • (A*) Spark+C • (B*) pySpark+C (A*),(B*),(C) execute identical C++ code * https://github.com/gingsmith/cocoa 100 iterations of CoCoA for 𝐼 fixed

  10. webspam dataset Communication-Computation Tradeoff 8 Spark workers 6x Understanding characteristics of the framework and correctly adapting the algorithm can decide upon orders of magnitude in performance!

  11. to the designer strive to design flexible algorithms that can be adapted to system characteristics to the user be aware that machine learning algorithms need to be tuned to achieve good performance C. Dünner, T. Parnell, K. Atasu, M. Sifalakis, H. Pozidis, “ Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark ”, IEEE International Conference on Big Data, Boston, 2017

  12. Which local solver should we use? ? H steps of ? H steps of local solver local solver ? H steps of local solver ? H steps of ? local solver H steps of local solver

  13. Stochastic Primal-Dual Coordinate Descent Methods good convergence properties sequential problem: cannot leverage full power of modern CPUs or GPUs Asynchronous implementations 𝑔 ∗ : strongly-convex 𝑔 : smooth 𝜷 𝑔 ∗ 𝜷 + 𝒙 𝑔 𝐵 ⊤ 𝒙 + ∗ (−𝐵 :𝑗 ⊤ 𝒙) min 𝑕 𝑗 (𝑥 𝑗 ) min 𝑕 𝑗 𝑗 𝑗 𝑀 2 -regularized SVM, Ridge Regression, 2 + 𝜇 1 Ridge Regression, Lasso, min 2 2𝑜 𝐵 ⊤ 𝒙−𝒛 2 2 𝒙 2 𝑀 2 -regularized Logistic Regression, … 𝒙 Logistic Regression…. 2 + 𝜇 𝒙 1 1 min 2𝑜 𝐵 ⊤ 𝒙−𝒛 2 𝒙

  14. 𝒙 𝑔 𝐵 ⊤ 𝒙 + Asynchronous Stochastic Algorithms min 𝑕 𝑗 (𝑥 𝑗 ) 𝑗 Parallelized over cores: Every core updates a dedicated subset of coordinates ! Write collision on shared vector Recompute shared vector Liu et al., ” AsySCD ” (JMLR’15) • Memory-locking K . Tran et al., “Scaling up SDCA, (SIGKDD’15 ) • • Live with undefined behavior C.- J. Hsieh et al. “ PASSCoDe :“ (ICML’15)

  15. 𝒙 𝑔 𝐵 ⊤ 𝒙 + Asynchronous Stochastic Algorithms min 𝑕 𝑗 (𝑥 𝑗 ) 𝑗 Parallelized over cores: Every core updates a dedicated subset of coordinates ! Write collision on shared vector Recompute shared vector Liu et al., ” AsySCD ” (JMLR’15) • Memory-locking K . Tran et al., “Scaling up SDCA, (SIGKDD’15 ) • • Live with undefined behavior C.- J. Hsieh et al. “ PASSCoDe :“ (ICML’15) Ridge Regression on webspam

  16. 2-level Parallelism of GPUs GPU Thread Block i SM1 SM2 SM3 SM4 … SM5 SM6 SM7 SM8 Main Memory Shared memory 1 st level of parallelism: A GPU consists of streaming multiprocessors • Thread blocks get assigned to multi-processors and are executed asynchronously • 2 nd level of parallelism Each thread block consists of up to 1024 threads • Threads are grouped into warps (32 threads) which are executed as SMDI operations •

  17. webspam dataset GPU: GeForce GTX 1080Ti GPU Acceleration CPU: 8-core Intel Xeon E5 A T wice P arallel A synchronous S tochastic C oordinate D escent (TPA-SCD) Algorithm 1. thread-blocks are executed in parallel, asynchronously updating one coordinate each. 10x 8x 2. Update computation within a thread block is interleaved to ensure memory locality within warp and local memory is used to accumulate partial sums 3. Atomic add functionality of modern GPUs is used to update shared vector SVM Ridge T. Parnell, C. Dünner, K. Atasu, M. Sifalakis, H. Pozidis , „ Large-Scale Stochastic Learning Using GPUs “, IEEE International Parallel and Distributed Processing Symposium (IPDPS), Lake Buena Vista, FL, 2017

  18. Which local solver should we use? ? H steps of ? It depends on the H steps of local solver available hardware local solver ? H steps of local solver ? H steps of ? local solver H steps of local solver

  19. Heterogeneous System GPU CPU core core core core core core

  20. Heterogeneous System GPU CPU core core core core core core

  21. GPU CPU Du al H eterogeneous L earning [NIPS’17] core core core core memo core core A scheme to efficiently use Limited-Memory Accelerators for Linear Learning memo Idea: The GPU should work on the part of the data it can learn most from. Contribution of individual data columns to the duality gap is indicative for their potential to improving the model Select coordinate j with largest duality gap Lasso epsilon dataset

  22. GPU CPU Du al H eterogeneous L earning [NIPS’17] core core core core memo core core A scheme to efficiently use Limited-Memory Accelerators for Linear Learning memo Idea: The GPU should work on the part of the data it can learn most from. Contribution of individual data columns to the duality gap is indicative for their potential to improving the model Duality Gap Computation is expensive! ∗ (𝒚 𝑘 ⊤ 𝒘) 𝐻𝑏𝑞 𝒙 = 𝑘 𝑥 𝑘 𝒚 𝑘 , 𝒘 + 𝑕 𝑘 𝑥 𝑘 + 𝑕 𝑘 • we introduce a gap-memory • Parallelization of workload between CPU and GPU GPU runs algorithm on subset of the data CPU computes importance values

  23. DuHL Algorithm GPU CPU C. Dünner, T. Parnell, M. Jaggi, „ Efficient Use of Limited-Memory Accelerators for Linear Learning on Heterogeneous Systems “, NIPS, Long Beach, CA, 2017

  24. ImageNet dataset 30GB DuHL: Performance Results GPU : NVIDIA Quadra M4000 (8GB) CPU : 8-core Intel Xeon X86 (64GB) Lasso 10 −1 Lasso 10 −2 suboptimality # swaps 10 −3 0 20 40 60 80 100 iterations 10 −4 Fig 2: I/O efficiency of DuHL 10 −5 0 20 40 60 80 100 iterations Reduced I/O cost and faster convergence accumulate to 10x speedup Fig 1: Superior convergence properties of DuHL over existing schemes

  25. ImageNet dataset 30GB DuHL: Performance Results GPU : NVIDIA Quadra M4000 (8GB) CPU : 8-core Intel Xeon X86 (64GB) Lasso SVM 10 −1 10 −1 10 −2 suboptimality 10 −2 dualitygap 10 −3 10 −3 10 −4 10 −4 10 −5 10 −5 0 100 200 300 0 100 200 300 time [s] time [s] Reduced I/O cost and faster convergence accumulate to 10x speedup

  26. Combining it all A library for ultra-fast machine learning Goal: remove training as a bottleneck Enable seamless retraining of models • • Enable agile development Enable training on large-scale datasets • Enable high quality insights • - Exploit Primal-Dual Structure of ML problems to minimize communication - Offer GPU acceleration - Implement DuHL for efficient utilization of limited-memory accelerators - Improved memory management of Spark

Recommend


More recommend