dynamic parameter allocation in parameter servers
play

Dynamic Parameter Allocation in Parameter Servers Alexander - PowerPoint PPT Presentation

Dynamic Parameter Allocation in Parameter Servers Alexander Renz-Wieland 1 , Rainer Gemulla 2 , Stefgen Zeuch 1,3 , Volker Markl 1,3 VLDB 2020 1 / 11 1 TU Berlin, 2 University of Mannheim, 3 DFKI Takeaways communication overhead 2 / 11 Key


  1. Dynamic Parameter Allocation in Parameter Servers Alexander Renz-Wieland 1 , Rainer Gemulla 2 , Stefgen Zeuch 1,3 , Volker Markl 1,3 VLDB 2020 1 / 11 1 TU Berlin, 2 University of Mannheim, 3 DFKI

  2. Takeaways communication overhead 2 / 11 ◮ Key challenge in distributed Machine Learning (ML): ◮ Parameter Servers (PSs) ◮ Intuitive ◮ Limited support for common techniques to reduce overhead ◮ How to improve support? ◮ Dynamic parameter allocation ◮ Is this support benefjcial? ◮ Up to two orders of magnitude faster

  3. Background: Distributed Machine Learning Physical worker worker worker parameters worker worker worker parameters worker worker worker parameters 3 / 11 worker Logical Parameter Server worker ◮ Distributed training is a necessity for large-scale ML tasks ◮ Parameter management is a key concern ◮ Parameter servers (PS) are widely used push() p pull() push() u pull() p s u h l ( l ) ( ) worker ...

  4. Problem: Communication Overhead Training knowledge graph embeddings (RESCAL, dimension 100): 4 / 11 ◮ Communication overhead can limit scalability ◮ Performance can fall behind a single node Epoch run time in minutes Classic PS (PS−Lite) 4.5h 4h 200 2.4h 100 1.5h 0 1x4 2x4 4x4 8x4 Parallelism (nodes x threads)

  5. Problem: Communication Overhead Training knowledge graph embeddings (RESCAL, dimension 100): 4 / 11 ◮ Communication overhead can limit scalability ◮ Performance can fall behind a single node Epoch run time in minutes Classic PS (PS−Lite) 4.5h 4h 200 Classic PS with fast local access 2.4h 100 1.5h 1.2h 0 1x4 2x4 4x4 8x4 Parallelism (nodes x threads)

  6. Problem: Communication Overhead Training knowledge graph embeddings (RESCAL, dimension 100): 4 / 11 ◮ Communication overhead can limit scalability ◮ Performance can fall behind a single node Epoch run time in minutes Classic PS (PS−Lite) 4.5h 4h 200 Classic PS with fast local access 2.4h 100 Dynamic Allocation PS (Lapse), ● 1.5h incl. fast local access 1.2h ● ● 0.6h ● 0 0.4h 0.2h 1x4 2x4 4x4 8x4 Parallelism (nodes x threads)

  7. How to reduce communication overhead? DATA worker 2 worker 1 5 / 11 Latency hiding PARAMETERS Parameter blocking PARAMETERS DATA Data clustering PARAMETERS DATA parameter access ◮ Common techniques to reduce overhead: ◮ Key is to avoid remote accesses ◮ Do PSs support these techniques? ◮ Techniques require local access at difgerent nodes over time ◮ But PSs allocate parameters statically

  8. How to reduce communication overhead? DATA worker 2 worker 1 5 / 11 Latency hiding PARAMETERS Parameter blocking PARAMETERS DATA Data clustering PARAMETERS DATA parameter access ◮ Common techniques to reduce overhead: ◮ Key is to avoid remote accesses ◮ Do PSs support these techniques? ◮ Techniques require local access at difgerent nodes over time ◮ But PSs allocate parameters statically

  9. Dynamic Parameter Allocation Localize(parameters) 6 / 11 ◮ What if the PS could allocate parameters dynamically? ◮ Would provide support for ◮ Data clustering � ◮ Parameter blocking � ◮ Latency hiding � ◮ We call this dynamic parameter allocation

  10. The Lapse Parameter Server Stale Serializability Sequential Causal PRAM Eventual 7 / 11 Lapse PS per-key consistency guarantees (for synchronous operations) Classic ◮ Features ◮ Dynamic allocation ◮ Location transparency ◮ Retains sequential consistency � � � � � � � � × � � × × × × ◮ Many system challenges (see paper) ◮ Manage parameter locations ◮ Route parameter accesses to current location ◮ Relocate parameters ◮ Handle reads and writes during relocations ◮ All while maintaining sequential consistency

  11. 3. Comparison to bounded staleness PSs 4. Comparison to manual management 5. Ablation study Experimental study Tasks: matrix factorization, knowledge graph embeddings, word vectors Cluster: 1–8 nodes, each with 4 worker threads, 10 GBit Ethernet 1. Performance of Classic PSs 2–28x faster and more scalable Competitive to a specialized low-level implementation Combining fast local access and dynamic allocation is key 8 / 11 ◮ 2–8 nodes barely outperformed 1 node in all tested tasks 2. Efgect of dynamic parameter allocation ◮ 4–203x faster than a Classic PSs, up to linear speed-ups Epoch run time in minutes Classic PS (PS−Lite) 4.5h 4h 200 Classic PS with fast local access 2.4h 100 Dynamic Allocation PS (Lapse), ● 1.5h incl. fast local access 1.2h ● ● 0.6h ● 0 0.4h 0.2h 1x4 2x4 4x4 8x4 Parallelism (nodes x threads)

  12. Comparison to Bounded Staleness PS Non-linear scaling 9 / 11 overhead Single-node ◮ Matrix factorization (matrix with 1b entries, rank 100) ◮ Parameter blocking 40 Epoch run time in minutes Bounded staleness PS (Petuum), client sync. 0.6x 30 20 Bounded staleness PS ● (Petuum), server sync. 10 ● Dynamic Allocation 2.9x ● PS (Lapse) 8.4x ● 0 1x4 2x4 4x4 8x4 Parallelism (nodes x threads)

  13. Comparison to Bounded Staleness PS Non-linear scaling 9 / 11 overhead Single-node ◮ Matrix factorization (matrix with 1b entries, rank 100) ◮ Parameter blocking 40 Epoch run time in minutes Bounded staleness PS (Petuum), client sync. 0.6x 30 20 Bounded staleness PS ● (Petuum), server sync. 10 ● Dynamic Allocation 2.9x ● PS (Lapse) 8.4x ● 0 1x4 2x4 4x4 8x4 Parallelism (nodes x threads)

  14. Experimental study Tasks: matrix factorization, knowledge graph embeddings, word vectors Cluster: 1–8 nodes, each with 4 worker threads, 10 GBit Ethernet 1. Performance of Classic PSs 10 / 11 ◮ 2–8 nodes barely outperformed 1 node in all tested tasks 2. Efgect of dynamic parameter allocation ◮ 4–203x faster than a Classic PSs, up to linear speed-ups 3. Comparison to bounded staleness PSs ◮ 2–28x faster and more scalable 4. Comparison to manual management ◮ Competitive to a specialized low-level implementation 5. Ablation study ◮ Combining fast local access and dynamic allocation is key

  15. Dynamic Parameter Allocation in Parameter Servers communication overhead https://github.com/alexrenz/lapse-ps 11 / 11 ◮ Key challenge in distributed Machine Learning (ML): ◮ Parameter Servers (PSs) ◮ Intuitive ◮ Limited support for common techniques to reduce overhead ◮ How to improve support? ◮ Dynamic parameter allocation ◮ Is this support benefjcial? ◮ Up to two orders of magnitude faster ◮ Lapse is open source:

Recommend


More recommend