S PECULATIVE L OAD B ALANCING Hassan Eslami William D. Gropp - PowerPoint PPT Presentation

S PECULATIVE L OAD B ALANCING Hassan Eslami William D. Gropp Department of Computer Science University of Illinois at Urbana Champaign

2 Continuous Dynamic Load Balancing • Irregular parallel applications • Irregular and unpredictable structure • Nested or recursive parallelism • Dynamic generation of units of computation • Available parallelism heavily depends on input data • Require continuous dynamic load balancing Optimization and search problems N-Body problems

3 Dynamic Load Balancing Model � TaskPool.initialize(initial tasks) � While (t TaskPool.get()) � � t.execute() � � In execute() , one may call TaskPool.put() � � Idle time in TaskPool.get() �

4 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2

12 How to Eliminate Idle Time? – Prefetching Thread 1 Thread 2 • Unpredictable workload • Data dependence and limited parallelism

13 How to Eliminate Idle Time? – Speculation Thread 1 Thread 2

14 How to Eliminate Idle Time? – Speculation Thread 1 Thread 2

15 How to Eliminate Idle Time? – Speculation Thread 1 Thread 2 Arbitration Request

16 How to Eliminate Idle Time? – Speculation Thread 1 Thread 2 Speculation Fail

17 Work Sharing Algorithm Manager Work Request Work Request Work Request Thread 0 Thread 1 Thread 2 Thread 3

18 Work Sharing Algorithm Manager Work Request Work Request Work Request Thread 0 Thread 1 Thread 2 Thread 3

19 Work Sharing Algorithm Manager Thread 0 Thread 1 Thread 2 Thread 3

20 Speculative Work Sharing Algorithm Manager Work Request Work Request Work Request Thread 0 Thread 1 Thread 2 Thread 3

21 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread

22 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread

23 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Arbitration Request for A A

24 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Arbitration Request for B A B

25 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Arbitration Request for E A B C D E

26 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Response for A: Success A Commit B C D E

27 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Response for B: Success Commit B C D E

28 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Response for C: Fail C Roll Back D Roll Back E Roll Back Delete

29 Speculative Work Sharing Algorithm Some Worker Thread Manager Thread Work Request

30 Unbalanced Tree Search (UTS) • Counting nodes in randomly generated tree • Tree generation is based on separable cryptographic random number generator childCount = f(nodeId) � childId = SHA1(nodeId, childIndex) � • Different types of trees • Binomial (probability q, # of child m) • Geometric (depth limit d, branching factor is geometric distribution with mean b)

31 Work Sharing in UTS • A node in tree is a unit of work • A chunk is a set of nodes, and minimum transferrable unit • Release interval is the frequency with which a worker releases a chunk to the manager If (HasSurplusWork() and � NodesProcessed % release_inerval == 0) � { � � ReleaseWork() � } �

32 Experimental Setup and Inputs • Illinois Campus Cluster • Cluster of HP ProLiant Servers • 2 Intel X5650 2.66Ghz 6Core Processors per Node • High Speed Infiniband Cluster Interconnect Binomial Geometry (10 9 Nodes) (10 9 Nodes) Small 0.111 0.102 Medium 2.79 1.64 Large 10.6 4.23

33 Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 40 35 30 25 Exec. Time (s) 20 15 10 5 4 0 1 10 100 Chunk Size

34 Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 40 35 30 25 Exec. Time (s) 20 15 10 5 4 8 0 1 10 100 Chunk Size

35 Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 40 35 30 25 Exec. Time (s) 20 15 10 5 4 8 16 0 1 10 100 Chunk Size

36 Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 40 35 30 25 Exec. Time (s) 20 15 10 4 5 8 16 32 0 1 10 100 Chunk Size

37 Tuning of Original Algorithm – Small Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 40 35 30 25 Exec. Time (s) 4 8 16 20 32 64 128 15 256 512 1024 10 2048 4096 8192 5 16384 32768 65536 0 1 10 100 Chunk Size

38 Original vs. Speculative Algorithm – Small Input (on 4 nodes, 12 cores each) Original Speculative Impact of release interval on execution time (Geometric Tree) Impact of release interval on execution time (Geometric Tree) 40 40 35 35 30 30 25 25 Exec. Time (s) Exec. Time (s) 4 4 8 8 16 16 20 20 32 32 64 64 128 128 15 15 256 256 512 512 1024 1024 10 10 2048 2048 4096 4096 8192 8192 5 5 16384 16384 32768 32768 65536 65536 0 0 1 10 100 1 10 100 Chunk Size Chunk Size

39 Tuning of Original Algorithm – Medium Input (on 4 nodes, 12 cores each) Impact of release interval on execution time (Geometric Tree) 50 45 • Optimal values: (128, 12) 40 • Some results for large input on 8 nodes 35 Exec. Time (s) Time (s) Time (s) 30 (128, 8) (128, 12) Original 50.385 26.681 25 16 Speculative 18.902 18.886 32 64 20 128 256 512 15 1024 2048 4096 10 1 10 100 Chunk Size

40 Scalability Study – Geometric Tree 180 160 140 120 Exec. Time (s) 100 80 60 40 20 Original Speculative 0 10 100 1000 # of MPI Ranks

41 Scalability Study – Binomial Tree 70 60 50 Exec. Time (s) 40 30 20 10 Original Speculative 0 10 100 1000 # of MPI Ranks

42 Conclusion • Speculation • Is a light-weight technique in load-balancing algorithms • Is a potential solution to eliminate idle time • Reduces sensitivity of a load-balancing algorithm to parameters • Helps to reduce tuning efforts • Exhibits a higher scalability

B ACK U P S LIDES

45 Design Guidelines • The time it takes to process a speculative task is far less than the time it takes to get response of an arbitration • A worker may need multiple speculative tasks at a time • Low overhead algorithm to get speculative task • Minimal speculative task transfer (i.e. minimizing speculative task destroy) • Quality of an speculative task decreases over time • Move actual task a worker has, less speculative task it should carry • Quality of an speculative task increases as it goes deeper in its owner’s actual queue

46 Does Speculation Help Work Stealing? • Base-line algorithm + speculative algorithm guidelines = speculative work stealing (Algorithm A) • Speculative work stealing + replacing speculative messages with prefetching = optimized prefetch-based work stealing (Algorithm B) • “A” has a slight performance benefit over “B” (less than 5 percent overall) • Reason: Even the base-line does not have too much idle time in UTS • … But, speculative work stealing is helpful in problems where there is a limited parallelism due to data dependence • Example: Depth-first traversal of a graph

S PECULATIVE L OAD B ALANCING Hassan Eslami William D. Gropp - PowerPoint PPT Presentation

S PECULATIVE L OAD B ALANCING Hassan Eslami William D. Gropp Department of Computer Science University of Illinois at Urbana Champaign 2 Continuous Dynamic Load Balancing Irregular parallel applications Irregular and unpredictable

R oad Safety Safety polic y polic y in Spain in Spain R oad Brussels, June 22 nd 2009 50%

SESAR 2020 - PJ09 DCB A dvanced D emand & C apacity B alancing 6 th 7 th March 2018

T HE B ALANCING I NCENTIVE P ROGRAM May 6, 2014 Program Overview Section 10202 of the

S URFACE W ATER A VAILABILITY IN M OROCCO B ALANCING U NCERTAINTY IN B IOPHYSICAL M ODELING OF C

SERRATOGA FALLS AMENDED SKETCH PLAN Serratoga Fall alls 1 st st Fili ling oad 3 treet ty

Le Lever eraging ging the Belt the Belt and R and Road oad Initia Initiativ tive e to Dr

F I L E UPL OAD E mploye r Se lf Se r vic e (E SS) Todays training will discuss how

Bruce Vento Regional Trail Hi Highway 96 to C 96 to Cou ounty R Roa oad J J Public lic M

In Innovative Ap Approach to o th the As Assessment of of Roa oad Agg Aggregates

Foo oothill thill Roa oad (D (Del elta ta Wate ters Rd. . to to Dr Dry Cr Cree eek Rd.

JOHNS ISL AND MAYBANK HIGHWAY AND MAIN R OAD ZONING PUBL IC WOR KSHOP Co- hoste d by Char

OAD IAU Office of Astronomy for Development www.astro4dev.org IAU-OAD Strategic Plan 2010-2020

FM 156 (H ASLET C OUNTY L INE R OAD ) FROM I NTERMODAL P ARKWAY TO US 81 / US 287 Public Meeting

Open vSwitch in Neutron Performance Challenges and Hardware O ffl oad Date: Hong Kong, 6th Nov.

Roa oad d of of Na National ional Significance: ignificance: Phoi to to Wa Warkworth

Ministry istry of New ew and d Re Renewable ewable Energy ergy 022 Roa oad d Ma Map p fo

Today Power management Hardware capabilities Software management strategies Power and

LLVM: A Compila ilation Framework for Lifelong Progr for Lifelong Progr gram Analysis and gram

2 = 2040400 + 480400 + 2220*0.16 = 1008355.2 (microJ) = 1.0083552 (J) b) (20 pts) Fixed

THREADED PROGRAMMING OpenMP Performance 2 A common scenario..... So I wrote my OpenMP

of the input-compute-output pattern Please sit with a NEW partner (not your robot partner) CSSE

CS Lunch Learn about the CS Department and CS major Today, 12:15 to 1:00 2 Midterm No homework

Filip Blagojevi , Costin Iancu, Katherine Yelick, Matthew Curtis-Maury, Dimitrios S.

Does Relative Deprivation Induce Migration? Evidence from Sub-Saharan Africa Kashi Kafle, Rui

S PECULATIVE L OAD B ALANCING Hassan Eslami William D. Gropp - PowerPoint PPT Presentation

S PECULATIVE L OAD B ALANCING Hassan Eslami William D. Gropp Department of Computer Science University of Illinois at Urbana Champaign 2 Continuous Dynamic Load Balancing Irregular parallel applications Irregular and unpredictable

R oad Safety Safety polic y polic y in Spain in Spain R oad Brussels, June 22 nd 2009 50%

SESAR 2020 - PJ09 DCB A dvanced D emand &amp; C apacity B alancing 6 th 7 th March 2018

T HE B ALANCING I NCENTIVE P ROGRAM May 6, 2014 Program Overview Section 10202 of the

S URFACE W ATER A VAILABILITY IN M OROCCO B ALANCING U NCERTAINTY IN B IOPHYSICAL M ODELING OF C

SERRATOGA FALLS AMENDED SKETCH PLAN Serratoga Fall alls 1 st st Fili ling oad 3 treet ty

Le Lever eraging ging the Belt the Belt and R and Road oad Initia Initiativ tive e to Dr

F I L E UPL OAD E mploye r Se lf Se r vic e (E SS) Todays training will discuss how

Bruce Vento Regional Trail Hi Highway 96 to C 96 to Cou ounty R Roa oad J J Public lic M

In Innovative Ap Approach to o th the As Assessment of of Roa oad Agg Aggregates

Foo oothill thill Roa oad (D (Del elta ta Wate ters Rd. . to to Dr Dry Cr Cree eek Rd.

JOHNS ISL AND MAYBANK HIGHWAY AND MAIN R OAD ZONING PUBL IC WOR KSHOP Co- hoste d by Char

OAD IAU Office of Astronomy for Development www.astro4dev.org IAU-OAD Strategic Plan 2010-2020

FM 156 (H ASLET C OUNTY L INE R OAD ) FROM I NTERMODAL P ARKWAY TO US 81 / US 287 Public Meeting

Open vSwitch in Neutron Performance Challenges and Hardware O ffl oad Date: Hong Kong, 6th Nov.

Roa oad d of of Na National ional Significance: ignificance: Phoi to to Wa Warkworth

Ministry istry of New ew and d Re Renewable ewable Energy ergy 022 Roa oad d Ma Map p fo

Today Power management Hardware capabilities Software management strategies Power and

LLVM: A Compila ilation Framework for Lifelong Progr for Lifelong Progr gram Analysis and gram

2 = 2040*400 + 480*400 + 2220*0.16 = 1008355.2 (microJ) = 1.0083552 (J) b) (20 pts) Fixed

THREADED PROGRAMMING OpenMP Performance 2 A common scenario..... So I wrote my OpenMP

of the input-compute-output pattern Please sit with a NEW partner (not your robot partner) CSSE

CS Lunch Learn about the CS Department and CS major Today, 12:15 to 1:00 2 Midterm No homework

Filip Blagojevi , Costin Iancu, Katherine Yelick, Matthew Curtis-Maury, Dimitrios S.

Does Relative Deprivation Induce Migration? Evidence from Sub-Saharan Africa Kashi Kafle, Rui

SESAR 2020 - PJ09 DCB A dvanced D emand & C apacity B alancing 6 th 7 th March 2018

2 = 2040400 + 480400 + 2220*0.16 = 1008355.2 (microJ) = 1.0083552 (J) b) (20 pts) Fixed