dependences for parallelization
play

Dependences for Parallelization Kaushik Rajan Abhishek Udupa - PowerPoint PPT Presentation

, , , , , - 1 0 1 ALTER: Exploiting Breakable Dependences for Parallelization Kaushik Rajan Abhishek Udupa William Thies Rigorous Software Engineering


  1. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - ⊺ βˆ’1 0 1 … … ⊺ ALTER: Exploiting Breakable Dependences for Parallelization Kaushik Rajan Abhishek Udupa William Thies Rigorous Software Engineering Microsoft Research, India

  2. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Parallelization Reconsidered ⊺ βˆ’1 0 1 … … ⊺ No DOALL Parallelism Are there dependences between loop iterations? Sequential program Yes

  3. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Parallelization Reconsidered ⊺ βˆ’1 0 1 … … ⊺ No DOALL Parallelism Are there dependences between loop iterations? Sequential program Yes Our Technique: SG3D Floyd-Warshall Agglomerative 2.0x speedup Clustering Gauss Seidel K-Means on four cores Break Commutativity Speculation Dependences! Analysis No No Speedup Speedup Dependences Dependences can Dependences can are Imprecise be Reordered be Broken Rigorous Software Engineering Rigorous Software Engineering Microsoft Research, India Microsoft Research, India

  4. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Parallelization Reconsidered ⊺ βˆ’1 0 1 … … ⊺ No DOALL Parallelism Are there dependences between loop iterations? Sequential program Yes Our Technique: SG3D Floyd-Warshall Agglomerative 2.0x speedup ALTER Clustering Gauss Seidel K-Means on four cores Break Commutativity Speculation Dependences! Analysis No No Speedup Speedup Dependences Dependences can Dependences can are Imprecise be Reordered be Broken Rigorous Software Engineering Rigorous Software Engineering Microsoft Research, India Microsoft Research, India

  5. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Outline ⊺ βˆ’1 0 1 … … ⊺ β€’ Breakable Dependences: Stale Reads β€’ Deterministic Runtime System β€’ Assisted Parallelization β€’ Results *other details in the paper*

  6. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ Breakable Dependences 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - ⊺ βˆ’1 0 1 … … in an Iterative Convergence Algorithm ⊺ Examples: while(!converged) { β€’ Floyd Warshall algorithm for i = 1 to n { refine(soln[i]) β€’ Monotonic data-flow analyses } β€’ Linear algebra solvers } β€’ Stencil computations sequential ALTER: s tale reads privatized DO DO DO WHILE WHILE WHILE I (n) I (n) I (n) … … … I (2) I (2) I (2) shared I (1) merge I (1) I (1) memory

  7. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Stale Reads Execution Model ⊺ βˆ’1 0 1 … … ⊺ W 1 W 2 3 1 5 7 8 2 4 6 𝑋 1 ∩ 𝑋 2 = * + Stale reads β€’ Execution valid under staleReads model iff – Commit order is some serial order of iterations (can be different from sequential order) – Each iteration reads a stale but consistent snapshot – Staleness is bounded: no intersecting writes by intervening iterations Akin to Snapshot Isolation for databases

  8. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Stale Reads with Reduction ⊺ βˆ’1 0 1 … … ⊺ 𝑆 βŠ† 𝑋 𝑆 βŠ† W 1 𝑋 2 , 𝑋 W 1 , 𝑋 2 2 1 3 1 5 7 8 2 4 6 𝑆 ) ∩ (𝑋 𝑆 ) = (𝑋 1 βˆ– W 2 βˆ– 𝑋 1 2 π‘ π‘“π‘’π‘£π‘‘π‘’π‘—π‘π‘œ 𝑆 ≔ 𝑀𝑏𝑠, 𝑃 where 1. Every access to var is an update using operation O 2. Operator O is commutative and associative

  9. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Deterministic Runtime System ⊺ βˆ’1 0 1 … … ⊺ state FORK() private private private β€’ body(1) β€’ body(2) β€’ body(3) with RW with RW with RW 1 2 3 EXECUTE() logging logging logging Commit? 2 Commit? 1 3 JOIN() state StaleReads Commit(i): βˆ€ π‘˜ 𝑑𝑒.π‘˜<𝑗 π‘₯𝑠𝑗𝑒𝑓𝑑 𝑗 ∩ π‘₯𝑠𝑗𝑒𝑓𝑑 π‘˜ = *+

  10. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Alter Annotations ⊺ βˆ’1 0 1 … … ⊺ while(error < EPSILON) { //convergence loop error = 0.0; for(uint32_t i = 1; i < grid->xmax - 1; ++i) { [ StaleReads, (error, max)] for(uint32_t j = 1; j < grid->ymax - 1; ++j) { for(uin32_t k = 1; k < grid->zmax - 1; ++k) { oldValue = grid[i][j][k] grid[i][j][k] = a * grid[i][j][k] + b * AddDirectNbr(grid) + c * AddSquareNbr(grid) + d * AddCubeNbr(grid); error = max(error, (OldValue,GridPtr[i][j][k]))); } }

  11. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Test Driven Parallelism Inference ⊺ βˆ’1 0 1 … … ⊺ Exhaustive parallelization engine β€’ For each annotation run all Sequential test cases, record outcome Test suite program β€’ outcome of a single run 𝑑𝑣𝑑𝑑𝑓𝑑𝑑, π‘”π‘π‘—π‘šπ‘£π‘ π‘“ ∈ ( crash, Exhaustive timeout, high contention, output parallelization engine mismatch ) οƒ˜ Output mismatch: assertion Candidate Parallel failures or floating point program User difference < 0.01% validation

  12. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Assisted Parallelism ⊺ βˆ’1 0 1 … … ⊺ ALTER Prior art Assisted parallelism Automatic parallelism Sequential Test suite Sequential program program Exhaustive Conservative parallelization engine Compiler analysis Candidate Parallel program User Parallel validation program Auto tune for perf Preserve program Preserve functionality dependences

  13. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Benchmarks ⊺ βˆ’1 0 1 … … ⊺ BENCHMARK ALGORITHM TYPE PARALLELISM LOOP WGT AggloClust Branch & bound STALE READS 89% GSdense Dense algebra STALE READS 100% GSsparse Sparse algebra STALE READS 100% FloydWarshall Dynamic programming STALE READS 100% SG3D Structured grids STALE READS, (error, max) 96% BarnesHut N-body methods DOALL 99.6% FFT Spectral methods DOALL 100% HMM Graphical models DOALL 100% Genome Bioinformatics STALE READS 89% SSCA2 Scientific STALE READS 76% Data mining K-means STALE READS, (delta, +) 89% Engineering Labyrinth _ 99%

  14. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Experimental Setup ⊺ βˆ’1 0 1 … … ⊺ β€’ Experiments on a 2 x quad core Xeon processor β€’ Alter transformations in Microsoft Phoenix compiler framework β€’ Comparison with dependence speculation and manual parallelization of 2 applications

  15. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Results : Baseline ⊺ βˆ’1 0 1 … … ⊺ 6 5 staleReads 4 OutOfOrder speculate 3 No scope for dependence DOALL 2 speculation 1 No scope for dependence 0 speculation

  16. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Results : Alter ⊺ βˆ’1 0 1 … … ⊺ 6 5 staleReads 4 OutOfOrder speculate 3 DOALL 2 1 0

  17. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Results: Manual Parallelization ⊺ βˆ’1 0 1 … … ⊺ 6 Good speedup with manual fine grain locking 5 staleReads Comparable 4 performance OutOfOrder 3 speculate DOALL 2 1 0

  18. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - In the Paper… ⊺ βˆ’1 0 1 … … ⊺ β€’ ALTER multi-process memory allocator β€’ ALTER collections β€’ Usage scenario’s for ALTER β€’ Profiling and instrumentation overhead β€’ DOALL parallelism and speculation within ALTER

  19. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Related Work ⊺ βˆ’1 0 1 … … ⊺ β€’ Test-driven parallelization – QuickStep: similar testing methods for non-deterministic programs, offers accuracy bounds [Rinard 2010] β€’ Assisted parallelization [Taylor 2011] [Tournavitis 2009] – Paralax: annotations improve precision of analysis, but dependences respected [Vandierendonck 2010] β€’ Implicit parallelization [Burckhardt 2010] – Commutative annotation for reordering[August 2007, 11] – Optimistic execution of irregular programs [Pingali 2008] – As far as we know, stale reads execution model is new

  20. 𝐼, 𝑓 β†’ 𝐼 β€² , 𝑓′ 𝐼, 𝑒 𝑓 β†’ 𝐼 β€² , 𝑒,𝑓 β€² - Conclusions ⊺ βˆ’1 0 1 … … ⊺ β€’ Breakable dependences must be exploited in order to parallelize certain classes of programs β€’ We propose a new execution model, StaleReads , that violates dependences in a principled way β€’ Adopt database notion of Snapshot Isolation for loop parallelization β€’ ALTER is a compiler and deterministic runtime system that discovers new parallelism in programs β€’ We believe tools for assisted parallelism can help to overcome the limits of automatic parallelization

Recommend


More recommend