scheduling strategies for optimistic parallel execution
play

Scheduling Strategies for Optimistic Parallel Execution of - PowerPoint PPT Presentation

Scheduling Strategies for Optimistic Parallel Execution of Irregular Programs Milind Kulkarni, Patrick Carribault, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala and L. Paul Chew University of Texas at Austin Cornell


  1. Scheduling Strategies for Optimistic Parallel Execution of Irregular Programs Milind Kulkarni, Patrick Carribault, Keshav Pingali, Ganesh Ramanarayanan, Bruce Walter, Kavita Bala and L. Paul Chew University of Texas at Austin Cornell University

  2. Amorphous Data Parallelism • Many irregular programs implement iterative algorithms over worklists ‣ Mesh refinement, agglomerative clustering, maxflow algorithms, compiler analyses, ... • Complex dependences between iterations • But many iterations can be executed in parallel • New elements can be added to worklist 2

  3. Delaunay Mesh Refinement (DMR) Worklist wl; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Triangle t = wl.get(); if (t no longer in mesh) continue; Cavity c = new Cavity(t); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } 3

  4. Delaunay Mesh Refinement (DMR) Worklist wl; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Triangle t = wl.get(); if (t no longer in mesh) continue; Cavity c = new Cavity(t); No ordering constraints on c.expand(); processing of worklist items c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } 3

  5. Parallelism in DMR • Can process bad triangles concurrently ‣ As long as cavities do not overlap ‣ Cannot determine this until run time • Example of amorphous data parallelism • Our approach: Galois system for optimistic parallelization [PLDI’07, ASPLOS’08] 4

  6. Galois System • User code ‣ Optimistic iterators foreach e in Set s do B(e) ‣ Sequential Semantics User Code • Class libraries ‣ Data structures Class Libraries ‣ Conflict conditions • Runtime system Runtime ‣ Optimistic parallelization ‣ Conflict detection & handling 5

  7. DMR User Code Worklist wl; wl.add(mesh.badTriangles()); while (wl.size() != 0) { Triangle t = wl.get(); if (t no longer in mesh) continue; Cavity c = new Cavity(t); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } 6

  8. DMR User Code Worklist wl; wl.add(mesh.badTriangles()); foreach Triangle t in wl { if (t no longer in mesh) continue; Cavity c = new Cavity(t); c.expand(); c.retriangulate(); mesh.update(c); wl.add(c.badTriangles()); } 7

  9. Scheduling Impact: DMR 2.2 stack 2 random 1.8 Speedup 1.6 1.4 1.2 1 0.8 1 2 3 4 # of Cores Evaluation platform: 4-core Xeon system, running Java 1.6 HotSpot JVM Input mesh: 100K triangles, ~40K bad triangles 8

  10. Scheduling in OpenMP • OpenMP provides parallel DO-ALL loops for regular programs • Major scheduling concerns are load- balancing and overhead • OpenMP scheduling policies address these issues ‣ static, dynamic, guided 9

  11. Amorphous Data Parallelism Issues • Algorithmic – The efficiency of the algorithm or data structures • Conflicts – The likelihood that two iterations executed in parallel will conflict • Locality – The temporal or spatial locality exhibited in the data structures • Dynamically created work • Load-balancing and contention still an issue 10

  12. Scheduling Basics • Each iteration is executed by a single core • Each core executes a set of iterations in a linear order • Scheduling maps work from an “iteration space” to positions in an “execution schedule” ‣ Each iteration is mapped to a core, and a position in that core’s execution schedule 11

  13. Scheduling Functions Clustering – Groups ➡ iterations into clusters; Each cluster executed on a single core Labeling – Maps clusters ➡ to cores; Each core can have multiple clusters Ordering – Specifies a • serial execution order for each core 12

  14. Scheduling Functions ➡ Clustering – Groups iterations into clusters; Each cluster executed on a single core Labeling – Maps clusters ➡ to cores; Each core can have multiple clusters Ordering – Specifies a • serial execution order for each core 13

  15. Scheduling Functions ➡ Clustering – Groups iterations into clusters; Each cluster executed on a single core Labeling – Maps clusters ➡ to cores; Each core can have multiple clusters Ordering – Specifies a • serial execution order for each core 13

  16. Scheduling Functions Clustering – Groups ➡ iterations into clusters; Each cluster executed on a single core ➡ Labeling – Maps clusters to cores; Each core can have multiple clusters Ordering – Specifies a • serial execution order for each core 14

  17. Scheduling Functions P0 Clustering – Groups ➡ iterations into clusters; Each cluster executed on a single core ➡ Labeling – Maps clusters to cores; Each core can have multiple clusters P1 Ordering – Specifies a • serial execution order for each core 14

  18. Scheduling Functions P0 Clustering – Groups ➡ iterations into clusters; Each cluster executed on a single core Labeling – Maps clusters ➡ to cores; Each core can have multiple clusters P1 ➡ Ordering – Specifies a serial execution order for each core 15

  19. Scheduling Functions P0 Clustering – Groups time ➡ iterations into clusters; Each cluster executed on a single core Labeling – Maps clusters ➡ to cores; Each core can have multiple clusters P1 ➡ Ordering – Specifies a time serial execution order for each core 15

  20. Scheduling Functions P0 Clustering – Groups time ➡ iterations into clusters; Each cluster executed on a single core Labeling – Maps clusters ➡ to cores; Each core can have multiple clusters P1 ➡ Ordering – Specifies a time serial execution order for each core Functions can be defined “online” 15

  21. Example Instantiations • OpenMP’s chunked • DMR’s “generator- self-scheduling computes” ‣ Clustering: chunked ‣ Clustering: chunked + generator-computes ‣ Labeling: dynamic ‣ Labeling: dynamic ‣ Ordering: cluster-major ‣ Ordering: LIFO The Galois system provides a number of built-in scheduling policies 16

  22. Evaluated Applications • Delaunay mesh refinement • Delaunay triangulation • Augmenting-paths maxflow • Preflow-push maxflow • Agglomerative clustering 17

  23. Sample Schedules for DMR • random – default Galois schedule • stack – LIFO schedule • partitioned – data-centric schedule, based on partitioning of mesh • generator-computes – random schedule, new work immediately processed by core that created it 18

  24. DMR Results generator-computes 3 partitioned stack 2.5 Speedup random 2 1.5 1 1 2 3 4 # of Cores 19

  25. Summary of Results • Best combination of policies for each application Clustering Labeling Ordering Delaunay Mesh random/ dynamic/ —/ Refinement inherited random LIFO Delaunay data-centric/ static/ cluster-major/ Triangulation — data-centric random Augmenting Paths data-centric/ static/ cluster-major/ Maxflow inherited data-centric LIFO Preflow Push data-centric/ static/ cluster-major/ Maxflow inherited data-centric LIFO Agglomerative unit/ dynamic/ —/ Clustering custom custom — 20

  26. Summary of Results • Best combination of policies for each application Clustering Labeling Ordering Delaunay Mesh random/ dynamic/ —/ Refinement inherited random LIFO Delaunay data-centric/ static/ cluster-major/ Triangulation — data-centric random Augmenting Paths data-centric/ static/ cluster-major/ Maxflow inherited data-centric LIFO Preflow Push data-centric/ static/ cluster-major/ Maxflow inherited data-centric LIFO Agglomerative unit/ dynamic/ —/ Clustering custom custom — 21

  27. Conclusions • Developed a general framework for scheduling programs with amorphous data parallelism ‣ Subsumes OpenMP scheduling policies • Implemented framework in Galois system ‣ Provides several default scheduling policies ‣ Allows programmers to specify their own scheduling policies when needed 22

Recommend


More recommend