Optimistic Parallelism Requires Abstractions Milind Kulkarni, - PowerPoint PPT Presentation
Optimistic Parallelism Requires Abstractions Milind Kulkarni, Keshav Pingali The University of Texas at Austin Bruce Walter, Ganesh Ramanarayanan, Kavita Bala and L. Paul Chew Cornell University Optimistic Parallelism Requires
Optimistic Parallelism Requires Abstractions Milind Kulkarni, Keshav Pingali – The University of Texas at Austin Bruce Walter, Ganesh Ramanarayanan, Kavita Bala and L. Paul Chew – Cornell University
Optimistic Parallelism Requires Abstractions Milind Kulkarni, Keshav Pingali – The University of Texas at Austin Bruce Walter, Ganesh Ramanarayanan, Kavita Bala and L. Paul Chew – Cornell University
Motivation ✦ Parallel programming very important ✦ Multicore processors ✦ Parallel programming is hard! ✦ Limited success in domains which deal with structured data ✦ Array programs ✦ Database applications ✦ What about irregular applications which deal with unstructured data? ✦ Compile time techniques have failed PLDI 2007 3 June 11th, 2007
Galois System: Core Beliefs ✦ Irregular applications have worklist-style data parallelism ✦ Optimistic parallelization is crucial ✦ Parallelism should be hidden within natural syntactic constructs ✦ High level application semantics are critical for parallelization PLDI 2007 4 June 11th, 2007
Outline ✦ Two challenge problems ✦ Galois programming model and implementation ✦ Evaluation ✦ Related Work ✦ Conclusions PLDI 2007 5 June 11th, 2007
Delaunay Mesh Refinement ✦ Iterative refinement procedure to produce guaranteed quality meshes PLDI 2007 6 June 11th, 2007
Delaunay Pseudo-code Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } PLDI 2007 7 June 11th, 2007
Delaunay Pseudo-code Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); Worklist idiom while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } PLDI 2007 8 June 11th, 2007
Finding Parallelism ✦ Can expand multiple cavities in parallel ✦ Provided cavities do not overlap ✦ Determining this statically is impossible ✦ Solution: Optimistic parallel execution PLDI 2007 9 June 11th, 2007
Agglomerative Clustering ✦ Create binary tree of points in a space in bottom-up fashion ✦ Always choose two closest points to cluster e e a a d d b b a b d c c c e (a) Data points (b) Hierarchical clusters (c) Dendrogram PLDI 2007 10 June 11th, 2007
Agglomerative Clustering ✦ Two key data structures ✦ Priority Queue – Keeps pairs of points < p , n > where n is the nearest neighbor of p ✦ Ordered by distance ✦ KD-tree – Spatial structure to find nearest neighbors PLDI 2007 11 June 11th, 2007
Finding Parallelism ✦ Priority queue functions as a worklist ✦ Seems to be completely sequential ✦ If clusters are independent, can be done in parallel a b d c e PLDI 2007 12 June 11th, 2007
Lessons Learned ✦ Worklist-style data parallelism ✦ May be dependences between iterations ✦ However, worklist abstractions are missing from the code ✦ Concurrent access to shared objects a must ✦ worklist, priority queue, kd-tree PLDI 2007 13 June 11th, 2007
Galois Programming Model and Implementation
Programming Model ✦ Object-based shared memory model Client Code ✦ Client code must Galois Objects invoke methods to access object state ✦ Client code has sequential semantics ✦ But runtime system may execute code in parallel PLDI 2007 15 June 11th, 2007
Worklist Abstractions ✦ Iterators over collections ✦ foreach e in set S do B(e) ✦ Iterations can execute in any order ✦ As in Delaunay mesh refinement ✦ foreach e in poSet S do B(e) ✦ Iterations must respect ordering of S ✦ As in agglomerative clustering ✦ May be dependences between iterations ✦ Sets can change during execution PLDI 2007 16 June 11th, 2007
Delaunay Example Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } PLDI 2007 17 June 11th, 2007
Delaunay Example Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); foreach Element e in wl { if (e no longer in mesh) rest of code unchanged continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } PLDI 2007 18 June 11th, 2007
Delaunay Example Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); foreach Element e in wl { if (e no longer in mesh) Iterators expose worklist abstraction continue; to runtime system Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } PLDI 2007 19 June 11th, 2007
Execution Model ✦ Master thread begins execution ✦ When it encounters an iterator, it uses helper threads to aid in execution of iterations ✦ Iterations assigned to thread according to scheduling policy (for now, dynamic to ensure load balance) ✦ Parallel execution of iterator must respect sequential semantics of iterator ✦ Concurrent access control ✦ Serializability of iterations PLDI 2007 20 June 11th, 2007
Concurrent Access ✦ Concurrent invocations S to a shared object must not interfere ✦ Our current implementation uses locks S.add(x) S.add(y) ✦ Can use other techniques such as TM PLDI 2007 21 June 11th, 2007
Serializability S Workset ... = S.get() ... = S.get() S.add(x) S.contains?(x) S.remove(x) S.add() S.add() (a) Interleaving is illegal (b) Interleaving is legal (and necessary) PLDI 2007 22 June 11th, 2007
Semantic Commutativity ✦ Method calls which commute can be interleaved ✦ Else, commutativity violation ✦ Property of abstract data type ✦ Implementation independent PLDI 2007 23 June 11th, 2007
Galois Classes class SetInterface { ✦ Inverse methods void add(T x); [commutes] ✦ Allow for rollback add(y) {y != x} remove(y) {y != x} when commutativity contains(y) {y != x} violated [inverse] remove(x) bool contains(T x); ✦ Commutativity and [commutes] add(y) {y != x} inverse specified through remove(y) {y != x} interface annotation ... } PLDI 2007 24 June 11th, 2007
Galois Classes class SetInterface { ✦ Inverse methods void add(T x); [commutes] ✦ Allow for rollback add(y) {y != x} remove(y) {y != x} when commutativity contains(y) {y != x} violated [inverse] Galois Classes expose abstractions to remove(x) the runtime system bool contains(T x); ✦ Commutativity and [commutes] add(y) {y != x} inverse specified through remove(y) {y != x} interface annotation ... } PLDI 2007 25 June 11th, 2007
Runtime System ✦ Two main components: ✦ Global commit pool ✦ Manages iterations ✦ Similar to reorder buffer in OOE processors ✦ Per object conflict logs ✦ Detects commutativity violations ✦ Triggers aborts if commutativity violated PLDI 2007 26 June 11th, 2007
Evaluation ✦ Evaluation platform: ✦ Implementation in C++ ✦ gcc compiler on Red Hat Linux ✦ 4 processor, shared memory system ✦ Itanium 2 @ 1.5 GHz PLDI 2007 27 June 11th, 2007
Evaluation – Delaunay ✦ Three different versions of benchmark ✦ reference – purely sequential code ✦ FGL – hand-written, optimistic parallel code using fine-grained locking ✦ meshgen – Galois version of code ✦ Input mesh generated using Triangle ✦ ~10K triangles ✦ ~4K bad triangles PLDI 2007 28 June 11th, 2007
Abort Ratios ✦ Optimism must be warranted ✦ Conflicts lead to rollbacks, which waste work ✦ FGL and meshgen have abort ratios <1% on 4 processors ✦ Closely tied to scheduling policy ✦ Choice of proper scheduling policy is crucial for good performance PLDI 2007 29 June 11th, 2007
Evaluation – Delaunay 8 Execution Time (s) 6 4 reference FGL meshgen 2 0 1 2 3 4 # of processors reference FGL 3 meshgen Speedup 2.5 2 1.5 1 1 2 3 4 # of processors PLDI 2007 30 June 11th, 2007
Evaluation – Delaunay 8 Execution Time (s) 6 4 reference FGL meshgen 2 0 1 2 3 4 # of processors reference FGL 3 meshgen Speedup 2.5 2 ~3x speedup 1.5 1 1 2 3 4 # of processors PLDI 2007 31 June 11th, 2007
Performance Breakdown Client Object Runtime 18.8501 20 20 17.4675 16.8889 Instructions (billions) 13.8951 15 15 Cycle (billions) 10 10 5 5 0 0 1 proc 4 proc 1 proc 4 proc PLDI 2007 32 June 11th, 2007
Related Work ✦ Weihl, 1988 – Concurrency control using commutativity properties of ADTs ✦ Rinard & Diniz, 1996 – Static commutativity analysis for parallelization ✦ Wu & Padua, 1998 – Exploiting semantic properties of containers in compilation ✦ Ni et al , 2007 – Open nesting using abstract locks PLDI 2007 33 June 11th, 2007
Conclusions ✦ Optimistic parallelism necessary to parallelize irregular, worklist-based applications ✦ Need to exploit high-level semantics ✦ Iterators to expose parallelism ✦ Galois classes to expose semantics of objects PLDI 2007 34 June 11th, 2007
Thank You! Email: milind@cs.utexas.edu
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.