optimistic parallelism requires abstractions
play

Optimistic Parallelism Requires Abstractions Milind Kulkarni, - PowerPoint PPT Presentation

Optimistic Parallelism Requires Abstractions Milind Kulkarni, Keshav Pingali The University of Texas at Austin Bruce Walter, Ganesh Ramanarayanan, Kavita Bala and L. Paul Chew Cornell University Optimistic Parallelism Requires


  1. Optimistic Parallelism Requires Abstractions Milind Kulkarni, Keshav Pingali – The University of Texas at Austin Bruce Walter, Ganesh Ramanarayanan, Kavita Bala and L. Paul Chew – Cornell University

  2. Optimistic Parallelism Requires Abstractions Milind Kulkarni, Keshav Pingali – The University of Texas at Austin Bruce Walter, Ganesh Ramanarayanan, Kavita Bala and L. Paul Chew – Cornell University

  3. Motivation ✦ Parallel programming very important ✦ Multicore processors ✦ Parallel programming is hard! ✦ Limited success in domains which deal with structured data ✦ Array programs ✦ Database applications ✦ What about irregular applications which deal with unstructured data? ✦ Compile time techniques have failed PLDI 2007 3 June 11th, 2007

  4. Galois System: Core Beliefs ✦ Irregular applications have worklist-style data parallelism ✦ Optimistic parallelization is crucial ✦ Parallelism should be hidden within natural syntactic constructs ✦ High level application semantics are critical for parallelization PLDI 2007 4 June 11th, 2007

  5. Outline ✦ Two challenge problems ✦ Galois programming model and implementation ✦ Evaluation ✦ Related Work ✦ Conclusions PLDI 2007 5 June 11th, 2007

  6. Delaunay Mesh Refinement ✦ Iterative refinement procedure to produce guaranteed quality meshes PLDI 2007 6 June 11th, 2007

  7. Delaunay Pseudo-code Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } PLDI 2007 7 June 11th, 2007

  8. Delaunay Pseudo-code Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); Worklist idiom while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } PLDI 2007 8 June 11th, 2007

  9. Finding Parallelism ✦ Can expand multiple cavities in parallel ✦ Provided cavities do not overlap ✦ Determining this statically is impossible ✦ Solution: Optimistic parallel execution PLDI 2007 9 June 11th, 2007

  10. Agglomerative Clustering ✦ Create binary tree of points in a space in bottom-up fashion ✦ Always choose two closest points to cluster e e a a d d b b a b d c c c e (a) Data points (b) Hierarchical clusters (c) Dendrogram PLDI 2007 10 June 11th, 2007

  11. Agglomerative Clustering ✦ Two key data structures ✦ Priority Queue – Keeps pairs of points < p , n > where n is the nearest neighbor of p ✦ Ordered by distance ✦ KD-tree – Spatial structure to find nearest neighbors PLDI 2007 11 June 11th, 2007

  12. Finding Parallelism ✦ Priority queue functions as a worklist ✦ Seems to be completely sequential ✦ If clusters are independent, can be done in parallel a b d c e PLDI 2007 12 June 11th, 2007

  13. Lessons Learned ✦ Worklist-style data parallelism ✦ May be dependences between iterations ✦ However, worklist abstractions are missing from the code ✦ Concurrent access to shared objects a must ✦ worklist, priority queue, kd-tree PLDI 2007 13 June 11th, 2007

  14. Galois Programming Model and Implementation

  15. Programming Model ✦ Object-based shared memory model Client Code ✦ Client code must Galois Objects invoke methods to access object state ✦ Client code has sequential semantics ✦ But runtime system may execute code in parallel PLDI 2007 15 June 11th, 2007

  16. Worklist Abstractions ✦ Iterators over collections ✦ foreach e in set S do B(e) ✦ Iterations can execute in any order ✦ As in Delaunay mesh refinement ✦ foreach e in poSet S do B(e) ✦ Iterations must respect ordering of S ✦ As in agglomerative clustering ✦ May be dependences between iterations ✦ Sets can change during execution PLDI 2007 16 June 11th, 2007

  17. Delaunay Example Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Element e = wl.get(); if (e no longer in mesh) continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } PLDI 2007 17 June 11th, 2007

  18. Delaunay Example Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); foreach Element e in wl { if (e no longer in mesh) rest of code unchanged continue; Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } PLDI 2007 18 June 11th, 2007

  19. Delaunay Example Mesh m = /* read in mesh */ WorkList wl; wl.add(mesh.badTriangles()); foreach Element e in wl { if (e no longer in mesh) Iterators expose worklist abstraction continue; to runtime system Cavity c = new Cavity(e); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } PLDI 2007 19 June 11th, 2007

  20. Execution Model ✦ Master thread begins execution ✦ When it encounters an iterator, it uses helper threads to aid in execution of iterations ✦ Iterations assigned to thread according to scheduling policy (for now, dynamic to ensure load balance) ✦ Parallel execution of iterator must respect sequential semantics of iterator ✦ Concurrent access control ✦ Serializability of iterations PLDI 2007 20 June 11th, 2007

  21. Concurrent Access ✦ Concurrent invocations S to a shared object must not interfere ✦ Our current implementation uses locks S.add(x) S.add(y) ✦ Can use other techniques such as TM PLDI 2007 21 June 11th, 2007

  22. Serializability S Workset ... = S.get() ... = S.get() S.add(x) S.contains?(x) S.remove(x) S.add() S.add() (a) Interleaving is illegal (b) Interleaving is legal (and necessary) PLDI 2007 22 June 11th, 2007

  23. Semantic Commutativity ✦ Method calls which commute can be interleaved ✦ Else, commutativity violation ✦ Property of abstract data type ✦ Implementation independent PLDI 2007 23 June 11th, 2007

  24. Galois Classes class SetInterface { ✦ Inverse methods void add(T x); [commutes] ✦ Allow for rollback add(y) {y != x} remove(y) {y != x} when commutativity contains(y) {y != x} violated [inverse] remove(x) bool contains(T x); ✦ Commutativity and [commutes] add(y) {y != x} inverse specified through remove(y) {y != x} interface annotation ... } PLDI 2007 24 June 11th, 2007

  25. Galois Classes class SetInterface { ✦ Inverse methods void add(T x); [commutes] ✦ Allow for rollback add(y) {y != x} remove(y) {y != x} when commutativity contains(y) {y != x} violated [inverse] Galois Classes expose abstractions to remove(x) the runtime system bool contains(T x); ✦ Commutativity and [commutes] add(y) {y != x} inverse specified through remove(y) {y != x} interface annotation ... } PLDI 2007 25 June 11th, 2007

  26. Runtime System ✦ Two main components: ✦ Global commit pool ✦ Manages iterations ✦ Similar to reorder buffer in OOE processors ✦ Per object conflict logs ✦ Detects commutativity violations ✦ Triggers aborts if commutativity violated PLDI 2007 26 June 11th, 2007

  27. Evaluation ✦ Evaluation platform: ✦ Implementation in C++ ✦ gcc compiler on Red Hat Linux ✦ 4 processor, shared memory system ✦ Itanium 2 @ 1.5 GHz PLDI 2007 27 June 11th, 2007

  28. Evaluation – Delaunay ✦ Three different versions of benchmark ✦ reference – purely sequential code ✦ FGL – hand-written, optimistic parallel code using fine-grained locking ✦ meshgen – Galois version of code ✦ Input mesh generated using Triangle ✦ ~10K triangles ✦ ~4K bad triangles PLDI 2007 28 June 11th, 2007

  29. Abort Ratios ✦ Optimism must be warranted ✦ Conflicts lead to rollbacks, which waste work ✦ FGL and meshgen have abort ratios <1% on 4 processors ✦ Closely tied to scheduling policy ✦ Choice of proper scheduling policy is crucial for good performance PLDI 2007 29 June 11th, 2007

  30. Evaluation – Delaunay 8 Execution Time (s) 6 4 reference FGL meshgen 2 0 1 2 3 4 # of processors reference FGL 3 meshgen Speedup 2.5 2 1.5 1 1 2 3 4 # of processors PLDI 2007 30 June 11th, 2007

  31. Evaluation – Delaunay 8 Execution Time (s) 6 4 reference FGL meshgen 2 0 1 2 3 4 # of processors reference FGL 3 meshgen Speedup 2.5 2 ~3x speedup 1.5 1 1 2 3 4 # of processors PLDI 2007 31 June 11th, 2007

  32. Performance Breakdown Client Object Runtime 18.8501 20 20 17.4675 16.8889 Instructions (billions) 13.8951 15 15 Cycle (billions) 10 10 5 5 0 0 1 proc 4 proc 1 proc 4 proc PLDI 2007 32 June 11th, 2007

  33. Related Work ✦ Weihl, 1988 – Concurrency control using commutativity properties of ADTs ✦ Rinard & Diniz, 1996 – Static commutativity analysis for parallelization ✦ Wu & Padua, 1998 – Exploiting semantic properties of containers in compilation ✦ Ni et al , 2007 – Open nesting using abstract locks PLDI 2007 33 June 11th, 2007

  34. Conclusions ✦ Optimistic parallelism necessary to parallelize irregular, worklist-based applications ✦ Need to exploit high-level semantics ✦ Iterators to expose parallelism ✦ Galois classes to expose semantics of objects PLDI 2007 34 June 11th, 2007

  35. Thank You! Email: milind@cs.utexas.edu

Recommend


More recommend