s peculative l oad b alancing
play

S PECULATIVE L OAD B ALANCING Hassan Eslami William D. Gropp - PowerPoint PPT Presentation

S PECULATIVE L OAD B ALANCING Hassan Eslami William D. Gropp Department of Computer Science University of Illinois at Urbana Champaign 2 Continuous Dynamic Load Balancing Irregular parallel applications Irregular and unpredictable


  1. S PECULATIVE L OAD B ALANCING Hassan Eslami William D. Gropp Department of Computer Science University of Illinois at Urbana Champaign

  2. 2 Continuous Dynamic Load Balancing • Irregular parallel applications • Irregular and unpredictable structure • Nested or recursive parallelism • Dynamic generation of units of computation • Available parallelism heavily depends on input data • Require continuous dynamic load balancing Optimization and search problems N-Body problems

  3. 3 Dynamic Load Balancing Model � TaskPool.initialize(initial tasks) � While (t TaskPool.get()) � � t.execute() � � In execute() , one may call TaskPool.put() � � Idle time in TaskPool.get() �

  4. 4 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2

  5. 5 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2

  6. 6 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2

  7. 7 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2

  8. 8 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2

  9. 9 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2

  10. 10 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2

  11. 11 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2

  12. 12 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2 • Unpredictable workload • Data dependence and limited parallelism

  13. 13 How to Eliminate Idle Time? – Speculation Thread 1 Thread 2

  14. 14 How to Eliminate Idle Time? – Speculation Thread 1 Thread 2

  15. 15 How to Eliminate Idle Time? – Speculation Thread 1 Thread 2 Arbitration Request

  16. 16 How to Eliminate Idle Time? – Speculation Thread 1 Thread 2 Speculation Fail

  17. 17 Work Sharing Algorithm Manager Work Request Work Request Work Request Thread 0 Thread 1 Thread 2 Thread 3

  18. 18 Work Sharing Algorithm Manager Work Request Work Request Work Request Thread 0 Thread 1 Thread 2 Thread 3

  19. 19 Work Sharing Algorithm Manager Thread 0 Thread 1 Thread 2 Thread 3

  20. 20 Speculative Work Sharing Algorithm Manager Work Request Work Request Work Request Thread 0 Thread 1 Thread 2 Thread 3

  21. 21 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread

  22. 22 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread

  23. 23 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Arbitration Request for A A

  24. 24 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Arbitration Request for B A B

  25. 25 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Arbitration Request for E A B C D E

  26. 26 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Response for A: Success A Commit B C D E

  27. 27 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Response for B: Success Commit B C D E

  28. 28 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Response for C: Fail C Roll Back D Roll Back E Roll Back Delete

  29. 29 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Work Request

  30. 30 Unbalanced Tree Search (UTS) • Counting nodes in randomly generated tree • Tree generation is based on separable cryptographic random number generator childCount = f(nodeId) � childId = SHA1(nodeId, childIndex) � • Different types of trees • Binomial (probability q, # of child m) • Geometric (depth limit d, branching factor is geometric distribution with mean b)

  31. 31 Work Sharing in UTS • A node in tree is a unit of work • A chunk is a set of nodes, and minimum transferrable unit • Release interval is the frequency with which a worker releases a chunk to the manager If (HasSurplusWork() and � NodesProcessed % release_inerval == 0) � { � � ReleaseWork() � } �

  32. 32 Experimental Setup and Inputs • Illinois Campus Cluster • Cluster of HP ProLiant Servers • 2 Intel X5650 2.66Ghz 6Core Processors per Node • High Speed Infiniband Cluster Interconnect Binomial Geometry (10 9 Nodes) (10 9 Nodes) Small 0.111 0.102 Medium 2.79 1.64 Large 10.6 4.23

  33. 33 Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 40 35 30 25 Exec. Time (s) 20 15 10 5 4 0 1 10 100 Chunk Size

  34. 34 Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 40 35 30 25 Exec. Time (s) 20 15 10 5 4 8 0 1 10 100 Chunk Size

  35. 35 Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 40 35 30 25 Exec. Time (s) 20 15 10 5 4 8 16 0 1 10 100 Chunk Size

  36. 36 Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 40 35 30 25 Exec. Time (s) 20 15 10 4 5 8 16 32 0 1 10 100 Chunk Size

  37. 37 Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 40 35 30 25 Exec. Time (s) 4 8 16 20 32 64 128 15 256 512 1024 10 2048 4096 8192 5 16384 32768 65536 0 1 10 100 Chunk Size

  38. 38 Original vs. Speculative Algorithm – Small Input (on 4 nodes, 12 cores each) Original Speculative Impact of release interval on execution time (Geometric Tree) Impact of release interval on execution time (Geometric Tree) 40 40 35 35 30 30 25 25 Exec. Time (s) Exec. Time (s) 4 4 8 8 16 16 20 20 32 32 64 64 128 128 15 15 256 256 512 512 1024 1024 10 10 2048 2048 4096 4096 8192 8192 5 5 16384 16384 32768 32768 65536 65536 0 0 1 10 100 1 10 100 Chunk Size Chunk Size

  39. 39 Tuning of Original Algorithm – Medium Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 50 45 • Optimal values: (128, 12) 40 • Some results for large input on 8 nodes 35 Exec. Time (s) Time (s) Time (s) 30 (128, 8) (128, 12) Original 50.385 26.681 25 16 Speculative 18.902 18.886 32 64 20 128 256 512 15 1024 2048 4096 10 1 10 100 Chunk Size

  40. 40 Scalability Study – Geometric Tree 180 160 140 120 Exec. Time (s) 100 80 60 40 20 Original Speculative 0 10 100 1000 # of MPI Ranks

  41. 41 Scalability Study – Binomial Tree 70 60 50 Exec. Time (s) 40 30 20 10 Original Speculative 0 10 100 1000 # of MPI Ranks

  42. 42 Conclusion • Speculation • Is a light-weight technique in load-balancing algorithms • Is a potential solution to eliminate idle time • Reduces sensitivity of a load-balancing algorithm to parameters • Helps to reduce tuning efforts • Exhibits a higher scalability

  43. B ACK U P S LIDES

  44. 44

  45. 45 Design Guidelines • The time it takes to process a speculative task is far less than the time it takes to get response of an arbitration • A worker may need multiple speculative tasks at a time • Low overhead algorithm to get speculative task • Minimal speculative task transfer (i.e. minimizing speculative task destroy) • Quality of an speculative task decreases over time • Move actual task a worker has, less speculative task it should carry • Quality of an speculative task increases as it goes deeper in its owner’s actual queue

  46. 46 Does Speculation Help Work Stealing? • Base-line algorithm + speculative algorithm guidelines = speculative work stealing (Algorithm A) • Speculative work stealing + replacing speculative messages with prefetching = optimized prefetch-based work stealing (Algorithm B) • “A” has a slight performance benefit over “B” (less than 5 percent overall) • Reason: Even the base-line does not have too much idle time in UTS • … But, speculative work stealing is helpful in problems where there is a limited parallelism due to data dependence • Example: Depth-first traversal of a graph

Recommend


More recommend