S PECULATIVE L OAD B ALANCING Hassan Eslami William D. Gropp Department of Computer Science University of Illinois at Urbana Champaign
2 Continuous Dynamic Load Balancing • Irregular parallel applications • Irregular and unpredictable structure • Nested or recursive parallelism • Dynamic generation of units of computation • Available parallelism heavily depends on input data • Require continuous dynamic load balancing Optimization and search problems N-Body problems
3 Dynamic Load Balancing Model � TaskPool.initialize(initial tasks) � While (t TaskPool.get()) � � t.execute() � � In execute() , one may call TaskPool.put() � � Idle time in TaskPool.get() �
4 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2
5 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2
6 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2
7 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2
8 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2
9 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2
10 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2
11 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2
12 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2 • Unpredictable workload • Data dependence and limited parallelism
13 How to Eliminate Idle Time? – Speculation Thread 1 Thread 2
14 How to Eliminate Idle Time? – Speculation Thread 1 Thread 2
15 How to Eliminate Idle Time? – Speculation Thread 1 Thread 2 Arbitration Request
16 How to Eliminate Idle Time? – Speculation Thread 1 Thread 2 Speculation Fail
17 Work Sharing Algorithm Manager Work Request Work Request Work Request Thread 0 Thread 1 Thread 2 Thread 3
18 Work Sharing Algorithm Manager Work Request Work Request Work Request Thread 0 Thread 1 Thread 2 Thread 3
19 Work Sharing Algorithm Manager Thread 0 Thread 1 Thread 2 Thread 3
20 Speculative Work Sharing Algorithm Manager Work Request Work Request Work Request Thread 0 Thread 1 Thread 2 Thread 3
21 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread
22 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread
23 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Arbitration Request for A A
24 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Arbitration Request for B A B
25 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Arbitration Request for E A B C D E
26 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Response for A: Success A Commit B C D E
27 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Response for B: Success Commit B C D E
28 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Response for C: Fail C Roll Back D Roll Back E Roll Back Delete
29 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Work Request
30 Unbalanced Tree Search (UTS) • Counting nodes in randomly generated tree • Tree generation is based on separable cryptographic random number generator childCount = f(nodeId) � childId = SHA1(nodeId, childIndex) � • Different types of trees • Binomial (probability q, # of child m) • Geometric (depth limit d, branching factor is geometric distribution with mean b)
31 Work Sharing in UTS • A node in tree is a unit of work • A chunk is a set of nodes, and minimum transferrable unit • Release interval is the frequency with which a worker releases a chunk to the manager If (HasSurplusWork() and � NodesProcessed % release_inerval == 0) � { � � ReleaseWork() � } �
32 Experimental Setup and Inputs • Illinois Campus Cluster • Cluster of HP ProLiant Servers • 2 Intel X5650 2.66Ghz 6Core Processors per Node • High Speed Infiniband Cluster Interconnect Binomial Geometry (10 9 Nodes) (10 9 Nodes) Small 0.111 0.102 Medium 2.79 1.64 Large 10.6 4.23
33 Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 40 35 30 25 Exec. Time (s) 20 15 10 5 4 0 1 10 100 Chunk Size
34 Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 40 35 30 25 Exec. Time (s) 20 15 10 5 4 8 0 1 10 100 Chunk Size
35 Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 40 35 30 25 Exec. Time (s) 20 15 10 5 4 8 16 0 1 10 100 Chunk Size
36 Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 40 35 30 25 Exec. Time (s) 20 15 10 4 5 8 16 32 0 1 10 100 Chunk Size
37 Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 40 35 30 25 Exec. Time (s) 4 8 16 20 32 64 128 15 256 512 1024 10 2048 4096 8192 5 16384 32768 65536 0 1 10 100 Chunk Size
38 Original vs. Speculative Algorithm – Small Input (on 4 nodes, 12 cores each) Original Speculative Impact of release interval on execution time (Geometric Tree) Impact of release interval on execution time (Geometric Tree) 40 40 35 35 30 30 25 25 Exec. Time (s) Exec. Time (s) 4 4 8 8 16 16 20 20 32 32 64 64 128 128 15 15 256 256 512 512 1024 1024 10 10 2048 2048 4096 4096 8192 8192 5 5 16384 16384 32768 32768 65536 65536 0 0 1 10 100 1 10 100 Chunk Size Chunk Size
39 Tuning of Original Algorithm – Medium Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 50 45 • Optimal values: (128, 12) 40 • Some results for large input on 8 nodes 35 Exec. Time (s) Time (s) Time (s) 30 (128, 8) (128, 12) Original 50.385 26.681 25 16 Speculative 18.902 18.886 32 64 20 128 256 512 15 1024 2048 4096 10 1 10 100 Chunk Size
40 Scalability Study – Geometric Tree 180 160 140 120 Exec. Time (s) 100 80 60 40 20 Original Speculative 0 10 100 1000 # of MPI Ranks
41 Scalability Study – Binomial Tree 70 60 50 Exec. Time (s) 40 30 20 10 Original Speculative 0 10 100 1000 # of MPI Ranks
42 Conclusion • Speculation • Is a light-weight technique in load-balancing algorithms • Is a potential solution to eliminate idle time • Reduces sensitivity of a load-balancing algorithm to parameters • Helps to reduce tuning efforts • Exhibits a higher scalability
B ACK U P S LIDES
44
45 Design Guidelines • The time it takes to process a speculative task is far less than the time it takes to get response of an arbitration • A worker may need multiple speculative tasks at a time • Low overhead algorithm to get speculative task • Minimal speculative task transfer (i.e. minimizing speculative task destroy) • Quality of an speculative task decreases over time • Move actual task a worker has, less speculative task it should carry • Quality of an speculative task increases as it goes deeper in its owner’s actual queue
46 Does Speculation Help Work Stealing? • Base-line algorithm + speculative algorithm guidelines = speculative work stealing (Algorithm A) • Speculative work stealing + replacing speculative messages with prefetching = optimized prefetch-based work stealing (Algorithm B) • “A” has a slight performance benefit over “B” (less than 5 percent overall) • Reason: Even the base-line does not have too much idle time in UTS • … But, speculative work stealing is helpful in problems where there is a limited parallelism due to data dependence • Example: Depth-first traversal of a graph
Recommend
More recommend