Optimistic Parallelism Requires Abstractions Milind Kulkarni, Keshav Pingali – The University of Texas at Austin Bruce Walter, Ganesh Ramanarayanan, Kavita Bala and L. Paul Chew – Cornell University
Optimistic Parallelism Requires Abstractions Milind Kulkarni, Keshav Pingali – The University of Texas at Austin Bruce Walter, Ganesh Ramanarayanan, Kavita Bala and L. Paul Chew – Cornell University
Motivation ✦ Parallel programming very important ✦ Multicore processors ✦ Parallel programming is hard! ✦ Limited success in domains which deal with structured data ✦ Array programs ✦ Database applications ✦ What about irregular applications which deal with unstructured data? ✦ Compile time techniques have failed PLDI 2007 3 June 11th, 2007
Galois System: Core Beliefs ✦ Irregular applications have worklist-style data parallelism ✦ Optimistic parallelization is crucial ✦ Parallelism should be hidden within natural syntactic constructs ✦ High level application semantics are critical for parallelization PLDI 2007 4 June 11th, 2007
Outline ✦ Two challenge problems ✦ Galois programming model and implementation ✦ Evaluation ✦ Related Work ✦ Conclusions PLDI 2007 5 June 11th, 2007
Delaunay Mesh Refinement ✦ Iterative refinement procedure to produce guaranteed quality meshes PLDI 2007 6 June 11th, 2007
Delaunay Pseudo-code Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } PLDI 2007 7 June 11th, 2007
Delaunay Pseudo-code Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); Worklist idiom while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } PLDI 2007 8 June 11th, 2007
Finding Parallelism ✦ Can expand multiple cavities in parallel ✦ Provided cavities do not overlap ✦ Determining this statically is impossible ✦ Solution: Optimistic parallel execution PLDI 2007 9 June 11th, 2007
Agglomerative Clustering ✦ Create binary tree of points in a space in bottom-up fashion ✦ Always choose two closest points to cluster e e a a d d b b a b d c c c e (a) Data points (b) Hierarchical clusters (c) Dendrogram PLDI 2007 10 June 11th, 2007
Agglomerative Clustering ✦ Two key data structures ✦ Priority Queue – Keeps pairs of points < p , n > where n is the nearest neighbor of p ✦ Ordered by distance ✦ KD-tree – Spatial structure to find nearest neighbors PLDI 2007 11 June 11th, 2007
Finding Parallelism ✦ Priority queue functions as a worklist ✦ Seems to be completely sequential ✦ If clusters are independent, can be done in parallel a b d c e PLDI 2007 12 June 11th, 2007
Lessons Learned ✦ Worklist-style data parallelism ✦ May be dependences between iterations ✦ However, worklist abstractions are missing from the code ✦ Concurrent access to shared objects a must ✦ worklist, priority queue, kd-tree PLDI 2007 13 June 11th, 2007
Galois Programming Model and Implementation
Programming Model ✦ Object-based shared memory model Client Code ✦ Client code must Galois Objects invoke methods to access object state ✦ Client code has sequential semantics ✦ But runtime system may execute code in parallel PLDI 2007 15 June 11th, 2007
Worklist Abstractions ✦ Iterators over collections ✦ foreach e in set S do B(e) ✦ Iterations can execute in any order ✦ As in Delaunay mesh refinement ✦ foreach e in poSet S do B(e) ✦ Iterations must respect ordering of S ✦ As in agglomerative clustering ✦ May be dependences between iterations ✦ Sets can change during execution PLDI 2007 16 June 11th, 2007
Delaunay Example Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } PLDI 2007 17 June 11th, 2007
Delaunay Example Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); foreach Element e in wl { if (e no longer in mesh) rest of code unchanged continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } PLDI 2007 18 June 11th, 2007
Delaunay Example Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); foreach Element e in wl { if (e no longer in mesh) Iterators expose worklist abstraction continue; to runtime system Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } PLDI 2007 19 June 11th, 2007
Execution Model ✦ Master thread begins execution ✦ When it encounters an iterator, it uses helper threads to aid in execution of iterations ✦ Iterations assigned to thread according to scheduling policy (for now, dynamic to ensure load balance) ✦ Parallel execution of iterator must respect sequential semantics of iterator ✦ Concurrent access control ✦ Serializability of iterations PLDI 2007 20 June 11th, 2007
Concurrent Access ✦ Concurrent invocations S to a shared object must not interfere ✦ Our current implementation uses locks S.add(x) S.add(y) ✦ Can use other techniques such as TM PLDI 2007 21 June 11th, 2007
Serializability S Workset ... = S.get() ... = S.get() S.add(x) S.contains?(x) S.remove(x) S.add() S.add() (a) Interleaving is illegal (b) Interleaving is legal (and necessary) PLDI 2007 22 June 11th, 2007
Semantic Commutativity ✦ Method calls which commute can be interleaved ✦ Else, commutativity violation ✦ Property of abstract data type ✦ Implementation independent PLDI 2007 23 June 11th, 2007
Galois Classes class SetInterface { ✦ Inverse methods void add(T x); [commutes] ✦ Allow for rollback add(y) {y != x} remove(y) {y != x} when commutativity contains(y) {y != x} violated [inverse] remove(x) bool contains(T x); ✦ Commutativity and [commutes] add(y) {y != x} inverse specified through remove(y) {y != x} interface annotation ... } PLDI 2007 24 June 11th, 2007
Galois Classes class SetInterface { ✦ Inverse methods void add(T x); [commutes] ✦ Allow for rollback add(y) {y != x} remove(y) {y != x} when commutativity contains(y) {y != x} violated [inverse] Galois Classes expose abstractions to remove(x) the runtime system bool contains(T x); ✦ Commutativity and [commutes] add(y) {y != x} inverse specified through remove(y) {y != x} interface annotation ... } PLDI 2007 25 June 11th, 2007
Runtime System ✦ Two main components: ✦ Global commit pool ✦ Manages iterations ✦ Similar to reorder buffer in OOE processors ✦ Per object conflict logs ✦ Detects commutativity violations ✦ Triggers aborts if commutativity violated PLDI 2007 26 June 11th, 2007
Evaluation ✦ Evaluation platform: ✦ Implementation in C++ ✦ gcc compiler on Red Hat Linux ✦ 4 processor, shared memory system ✦ Itanium 2 @ 1.5 GHz PLDI 2007 27 June 11th, 2007
Evaluation – Delaunay ✦ Three different versions of benchmark ✦ reference – purely sequential code ✦ FGL – hand-written, optimistic parallel code using fine-grained locking ✦ meshgen – Galois version of code ✦ Input mesh generated using Triangle ✦ ~10K triangles ✦ ~4K bad triangles PLDI 2007 28 June 11th, 2007
Abort Ratios ✦ Optimism must be warranted ✦ Conflicts lead to rollbacks, which waste work ✦ FGL and meshgen have abort ratios <1% on 4 processors ✦ Closely tied to scheduling policy ✦ Choice of proper scheduling policy is crucial for good performance PLDI 2007 29 June 11th, 2007
Evaluation – Delaunay 8 Execution Time (s) 6 4 reference FGL meshgen 2 0 1 2 3 4 # of processors reference FGL 3 meshgen Speedup 2.5 2 1.5 1 1 2 3 4 # of processors PLDI 2007 30 June 11th, 2007
Evaluation – Delaunay 8 Execution Time (s) 6 4 reference FGL meshgen 2 0 1 2 3 4 # of processors reference FGL 3 meshgen Speedup 2.5 2 ~3x speedup 1.5 1 1 2 3 4 # of processors PLDI 2007 31 June 11th, 2007
Performance Breakdown Client Object Runtime 18.8501 20 20 17.4675 16.8889 Instructions (billions) 13.8951 15 15 Cycle (billions) 10 10 5 5 0 0 1 proc 4 proc 1 proc 4 proc PLDI 2007 32 June 11th, 2007
Related Work ✦ Weihl, 1988 – Concurrency control using commutativity properties of ADTs ✦ Rinard & Diniz, 1996 – Static commutativity analysis for parallelization ✦ Wu & Padua, 1998 – Exploiting semantic properties of containers in compilation ✦ Ni et al , 2007 – Open nesting using abstract locks PLDI 2007 33 June 11th, 2007
Conclusions ✦ Optimistic parallelism necessary to parallelize irregular, worklist-based applications ✦ Need to exploit high-level semantics ✦ Iterators to expose parallelism ✦ Galois classes to expose semantics of objects PLDI 2007 34 June 11th, 2007
Thank You! Email: milind@cs.utexas.edu
Recommend
More recommend