Lonestar: A Suite of Parallel Irregular Programs Milind Kulkarni, Martin C ă lin Ca ş caval Burtscher and Keshav Pingali Tuesday, April 21, 2009
Lonestar: A Suite of Parallel Irregular Programs Milind Kulkarni, Martin C ă lin Ca ş caval Burtscher and Keshav Pingali Tuesday, April 21, 2009
Why Another Benchmark Suite? • We understand parallelism in regular algorithms • e.g. , in N × N matrix-matrix multiply, can do N 3 multiplications concurrently • What about irregular algorithms? • Operate on complex, pointer-based data structures such as graphs, trees, etc. • Very input dependent behavior • Is there much parallelism? Can this parallelism be exploited? 3 Tuesday, April 21, 2009
Example Algorithms Application Domain Algorithms Data-mining Agglomerative clustering, k-means Bayesian inference Belief propagation, survey propagation Compilers Iterative dataflow, Elimination-based dataflow Functional interpreters Graph reduction, static/dynamic dataflow Maxflow Preflow-push, augmenting paths Minimum spanning trees Prim’s, Kruskal’s Boruvka’s N-body methods Barnes-Hut, fast multipole Graphics Ray-tracing Linear solvers Sparse MVM, sparse Cholesky factorization Event-driven simulation Time warp, Chandy-Misra-Bryant Meshing Delaunay mesh refinement, triangulation 4 Tuesday, April 21, 2009
Example: Delaunay Mesh Refinement • Worklist of bad triangles • Process bad triangles by removing “cavity” and re- triangulating • May create new bad Before triangles • Triangles can be processed in any order • Algorithm terminates when worklist is empty After 5 Tuesday, April 21, 2009
Example: Event-driven Simulation • Network of nodes • Worklist of events, 3 A ordered by timestamp • Nodes process events, can generate new events to send to other nodes • Events must be processed B in global time order 6 Tuesday, April 21, 2009
Example: Event-driven Simulation 2 • Network of nodes • Worklist of events, 3 A ordered by timestamp • Nodes process events, 4 can generate new events to send to other nodes • Events must be processed B in global time order 6 Tuesday, April 21, 2009
Example: Event-driven Simulation • Network of nodes • Worklist of events, 3 A 2 ordered by timestamp • Nodes process events, 4 can generate new events to send to other nodes • Events must be processed B in global time order 6 Tuesday, April 21, 2009
Example: Event-driven Simulation • Network of nodes • Worklist of events, A 2 ordered by timestamp • Nodes process events, 4 3 can generate new events to send to other nodes • Events must be processed B in global time order 6 Tuesday, April 21, 2009
Example: Event-driven Simulation • Network of nodes • Worklist of events, A 2 ordered by timestamp • Nodes process events, 4 6 3 can generate new events to send to other nodes • Events must be processed B in global time order 6 Tuesday, April 21, 2009
A Unified Approach to Irregular Algorithms • Want to raise the level of abstraction, find commonalities between algorithms • Inspired by Wirth’s aphorism, “Program = Algorithm + Data Structure” • Abstract data structure: a graph • Abstract algorithm: • Operate over ordered or unordered worklists of active nodes • Process an active node by accessing neighborhood • May generate new active nodes 7 Tuesday, April 21, 2009
A Unified Approach to Irregular Algorithms • Want to raise the level of abstraction, find commonalities between algorithms • Inspired by Wirth’s aphorism, “Program = Algorithm + Data Structure” • Abstract data structure: a graph • Abstract algorithm: • Operate over ordered or unordered worklists of active nodes • Process an active node by accessing neighborhood • May generate new active nodes 7 Tuesday, April 21, 2009
A Unified Approach to Irregular Algorithms • Want to raise the level of abstraction, find commonalities between algorithms • Inspired by Wirth’s aphorism, “Program = Algorithm + Data Structure” • Abstract data structure: a graph • Abstract algorithm: • Operate over ordered or unordered worklists of active nodes • Process an active node by accessing neighborhood • May generate new active nodes 7 Tuesday, April 21, 2009
Amorphous Data Parallelism • Where’s the parallelism? • Can process nodes with non-overlapping neighborhoods in parallel • Ordered worklists: must respect ordering constraints • In general, must use optimistic/speculative parallelism 8 Tuesday, April 21, 2009
Lonestar Benchmark Suite • Suite of irregular programs that exhibit amorphous data parallelism • Agglomerative clustering (AC) • Barnes-Hut (BH) • Delaunay mesh refinement (DMR) • Delaunay triangulation (DT) • Survey propagation (SP) 9 Tuesday, April 21, 2009
Why These Applications? • Real-world applications, which perform a substantial amount of work • Algorithms have significant potential for parallelism • Parallel implementations exhibit significant speedup 10 Tuesday, April 21, 2009
Why These Applications? • Real-world applications, which perform a substantial amount of work • Algorithms have significant potential for parallelism • Parallel implementations exhibit significant speedup 11 Tuesday, April 21, 2009
Performance Characteristics • Used performance counters on SPARC IV platform to gather performance characteristics of sequential execution Memory Instructions/ Memory Acc/ L1d Miss Application Input Size Iterations Footprint Iteration Iteration Rate AC 2M points 1,999,999 1,039 MB 67,920 13,832 7.1% BH 220K bodies 220,000 41 MB 199,167 49,789 14.1% DMR 550K triangles 1,297,380 2,545 MB 72,747 22,684 31.7% DT 80K points 80,000 927 MB 262,952 91,547 41.1% 500 variables SP 4,492,403 8 MB 177,885 42,016 2.1% 2100 clauses 12 Tuesday, April 21, 2009
Summary of Characteristics • In each benchmark, an average iteration executes tens of thousands of instructions • Benchmarks perform many memory accesses • Three of five benchmarks exhibit high L1 data cache miss rates 13 Tuesday, April 21, 2009
Why These Applications? • Real-world applications which perform a substantial amount of work • Algorithms have significant potential for parallelism • Parallel implementations exhibit significant speedup 14 Tuesday, April 21, 2009
Measuring Potential Parallelism • How much parallelism actually exists in these applications? • Independent of implementation details, particular architectures, etc . • Used ParaMeter to measure available parallelism in Lonestar applications • Kulkarni et al. , “How Much Parallelism is There in Irregular Applications?” • Key idea: determine how many active nodes have non- overlapping neighborhoods • Generate parallelism profiles: number of parallel computations executed in each step 15 Tuesday, April 21, 2009
ParaMeter Results – Delaunay Mesh Refinement • Input: 220K bad triangles 16000 Available Parallelism 14000 • ~50% of triangles are 12000 10000 badly shaped 8000 6000 • Bell-shaped profile 4000 2000 reflects increasing size of 0 0 10 20 30 40 50 60 mesh Computation Step 16 Tuesday, April 21, 2009
ParaMeter Results – Agglomerative Clustering • Data mining algorithm that clusters points based 300000 Available Parallelism on similarity 250000 • Builds binary tree in 200000 150000 bottom-up manner 100000 50000 • Input: 1,000,000 points 0 • Parallelism determined by 0 20 40 60 80 100 Computation Step structure of binary tree 17 Tuesday, April 21, 2009
ParaMeter Results – Delaunay Triangulation • Constructs Delaunay mesh from set of input points 1200 Available Parallelism 1000 • Input size: 40,000 points 800 600 • Similar bell-shaped profile 400 as DMR 200 0 • Less parallelism in the 0 20 40 60 80 100 Computation Step beginning because mesh starts very small 18 Tuesday, April 21, 2009
ParaMeter Results – Survey Propagation • SAT solving heuristic • Formula represented as a bipartite graph. 350 • Available Parallelism 300 Iterate over variables, 250 updating guess for truth 200 value 150 • 100 Input: 350 variables, 1470 50 clauses 0 0 5000 10000 15000 20000 25000 30000 • Parallelism profile more Computation Step uniform, as graph doesn’t change dramatically. • Parallelism drops as variables are assigned truth values. 19 Tuesday, April 21, 2009
Available Parallelism Summary • All Lonestar benchmarks have significant available parallelism • Different benchmarks display different parallelism behaviors – parallelism is clearly application dependent • Available parallelism increases for each benchmark as input size increases 20 Tuesday, April 21, 2009
Why These Applications? • Real-world applications which perform a substantial amount of work • Algorithms have significant potential for parallelism • Parallel implementations exhibit significant speedup 21 Tuesday, April 21, 2009
Recommend
More recommend