How Much Parallelism is There in Irregular Applications? Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu C ă lin Ca ş caval and Keshav Pingali
How Much Parallelism is There in Irregular Applications? Milind Kulkarni, Martin Burtscher, Rajasekhar Inkulu C ă lin Ca ş caval and Keshav Pingali
Introduction • We understand parallelism in regular algorithms • e.g. , in N × N matrix-matrix multiply, can do N 3 multiplications concurrently • What about irregular algorithms? • Operate on complex, pointer-based data structures such as graphs, trees, etc. • Is there much parallelism? 3
Example Algorithms Application Domain Algorithms Data-mining Agglomerative clustering, k-means Bayesian inference Belief propagation, survey propagation Compilers Iterative dataflow, Elimination-based dataflow Functional interpreters Graph reduction, static/dynamic dataflow Maxflow Preflow-push, augmenting paths Minimum spanning trees Prim’s, Kruskal’s Boruvka’s N-body methods Barnes-Hut, fast multipole Graphics Ray-tracing Linear solvers Sparse MVM, sparse Cholesky factorization Event-driven simulation Time warp, Chandy-Misra-Bryant Meshing Delaunay mesh refinement, triangulation 4
Example: Delaunay mesh refinement • Worklist of bad triangles • Process bad triangles by removing “cavity” and re- triangulating • May create new bad triangles Before • Triangles can be processed in any order • Algorithm terminates when worklist is empty After 5
Example: Event-driven Simulation • Network of nodes • Worklist of events, ordered by timestamp 3 A • Nodes process events, can generate new events to send to other nodes • Events must be processed B in global time order 6
Example: Event-driven Simulation • Network of nodes 2 • Worklist of events, ordered by timestamp 3 A • Nodes process events, 4 can generate new events to send to other nodes • Events must be processed B in global time order 6
Example: Event-driven Simulation • Network of nodes • Worklist of events, ordered by timestamp 3 A 2 • Nodes process events, 4 can generate new events to send to other nodes • Events must be processed B in global time order 6
Example: Event-driven Simulation • Network of nodes • Worklist of events, ordered by timestamp A 2 • Nodes process events, 4 can generate new events 3 to send to other nodes • Events must be processed B in global time order 6
Example: Event-driven Simulation • Network of nodes • Worklist of events, ordered by timestamp A 2 • Nodes process events, 4 can generate new events 3 6 to send to other nodes • Events must be processed B in global time order 6
Amorphous Data Parallelism • Data structure: graph • Operate over ordered or unordered worklists of active nodes • Process an active node by accessing neighborhood • May generate new active nodes • Can process nodes with non-overlapping neighborhoods in parallel • Ordered worklists: must respect ordering constraints 7
“Available Parallelism” • A measure of the maximum amount of parallelism that can be extracted from a program • Profile the algorithm, not the system • Disregard communication/synchronization costs, run-time overheads and locality concerns 8
Measuring Parallelism • Represent program as a DAG LOAD A1 LOAD B1 LOAD A2 LOAD B2 • Nodes: operations • Edges: dependences MUL MUL • Execution strategy • Assume operations take ADD unit time • Execute “greedily” – process all ready Parallelism Available operations in each step LOAD A1 LOAD B1 • Parallelism profile: # of LOAD A2 MUL operations executed in each LOAD B2 MUL ADD step Computation Step 9
Measuring Parallelism • Represent program as a DAG LOAD A1 LOAD B1 LOAD A2 LOAD B2 • Nodes: operations • Edges: dependences MUL MUL • Execution strategy • Assume operations take ADD unit time • Execute “greedily” – process all ready Parallelism Available operations in each step LOAD A1 LOAD B1 • Parallelism profile: # of LOAD A2 MUL operations executed in each LOAD B2 MUL ADD step Computation Step 9
Measuring Parallelism • Represent program as a DAG LOAD A1 LOAD B1 LOAD A2 LOAD B2 • Nodes: operations • Edges: dependences MUL MUL • Execution strategy • Assume operations take ADD unit time • Execute “greedily” – process all ready Parallelism Available operations in each step LOAD A1 LOAD B1 • Parallelism profile: # of LOAD A2 MUL operations executed in each LOAD B2 MUL ADD step Computation Step 9
Measuring Parallelism • Represent program as a DAG LOAD A1 LOAD B1 LOAD A2 LOAD B2 • Nodes: operations • Edges: dependences MUL MUL • Execution strategy • Assume operations take ADD unit time • Execute “greedily” – process all ready Parallelism Available operations in each step LOAD A1 LOAD B1 • Parallelism profile: # of LOAD A2 MUL operations executed in each LOAD B2 MUL ADD step Computation Step 9
Amorphous Data Parallel Algorithms • No notion of ordering • Represent program as a graph, not a DAG • Execution: choose set of independent elements to process • Different scheduling choices lead to different amounts of parallelism • Even with unlimited resources! 10
Amorphous Data Parallel Algorithms • No notion of ordering • Represent program as a graph, not a DAG • Execution: choose set of independent elements to process • Different scheduling choices lead to different amounts of Conflict Graph parallelism • Even with unlimited resources! 11
Greedy scheduling • Finding schedule to maximize parallelism is NP -hard ➡ Solution: Schedule greedily • Attempt to maximize work done in current step • Choose a maximal independent set in conflict graph 12
Incremental Execution • Conflict graph can change during execution • New work generated • New conflicts • Cannot perform scheduling a priori ➡ Solution: execute in Conflict Graph stages, recalculate conflict graph after each stage 13
Incremental Execution • Conflict graph can change during execution • New work generated • New conflicts • Cannot perform scheduling a priori ➡ Solution: execute in Conflict Graph stages, recalculate conflict graph after each stage 14
Incremental Execution • Conflict graph can change during execution • New work generated • New conflicts • Cannot perform scheduling a priori ➡ Solution: execute in Conflict Graph stages, recalculate conflict graph after each stage 14
ParaMeter • Tool to generate parallelism profiles for amorphous data-parallel applications • Uses greedy scheduling and incremental execution to handle dynamic nature of computation 15
ParaMeter Execution Strategy • While work left • Generate conflict graph for current worklist • Execute maximal independent set of nodes in graph • Add newly generated work to worklist 16
ParaMeter Execution Strategy • While work left • Generate conflict graph for current Generate parallelism profile by tracking # worklist of nodes executed in each step • Execute maximal independent set of nodes in graph • Add newly generated work to worklist 16
Experiments • Profiled 7 applications: • Delaunay mesh refinement • Delaunay triangulation • Augmenting paths maxflow • Preflow push maxflow • Survey propagation • Agglomerative clustering (unordered) • Agglomerative clustering (ordered) 17
Delaunay Mesh Refinement 8000 Available Parallelism 6000 4000 2000 0 0 10 20 30 40 50 60 Computation Step Input: 100,000 triangle mesh, 47,000 bad triangles 18
Parallelism Intensity • Available parallelism shows absolute amount of parallelism in program • Is parallelism low because there is little work? Or many conflicts? • Parallelism intensity: measure what percentage of worklist is executed in parallel 19
Mesh Refinement: Parallelism Intensity % of Worklist Executed 100 80 60 40 20 0 0 10 20 30 40 50 60 Computation Step 20
Effects of Scheduling on Parallelism Available Parallelism 6000 4000 2000 0 0 10 20 30 40 50 60 Computation Step Input: 100,000 triangle mesh, 47,000 bad triangles 21
Effects of Scheduling on Parallelism Input: 100,000 triangle mesh, 47,000 bad triangles 22
Effects of Scheduling on Parallelism 7500 Available Parallelism 7000 6500 6000 5500 5000 0 5 10 15 20 Computation Step Input: 100,000 triangle mesh, 47,000 bad triangles 23
Delaunay Triangulation 300 • Build a Delaunay mesh Available Parallelism 250 200 given a set of points 150 • Points in an unordered 100 50 worklist 0 0 20 40 60 80 100 • Insert points by splitting Computation Step triangles, flipping edges % of Worklist Executed 100 80 60 • Input: 10,000 points 40 20 0 0 20 40 60 80 100 Computation Step 24
Survey Propagation • Heuristic approach to 1000 Available Parallelism 800 solving SAT problems 600 • Bipartite graph of clauses 400 and variables 200 0 • Iteratively update 0 2000 4000 6000 8000 10000 12000 14000 Computation Step variables with possible truth values % of Worklist Executed 100 80 60 • Input: formula with 1000 40 20 variables, 4200 clauses 0 0 2000 4000 6000 8000 10000 12000 14000 Computation Step 25
Recommend
More recommend