 
              πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - βΊ β1 0 1 β¦ β¦ βΊ ALTER: Exploiting Breakable Dependences for Parallelization Kaushik Rajan Abhishek Udupa William Thies Rigorous Software Engineering Microsoft Research, India
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Parallelization Reconsidered βΊ β1 0 1 β¦ β¦ βΊ No DOALL Parallelism Are there dependences between loop iterations? Sequential program Yes
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Parallelization Reconsidered βΊ β1 0 1 β¦ β¦ βΊ No DOALL Parallelism Are there dependences between loop iterations? Sequential program Yes Our Technique: SG3D Floyd-Warshall Agglomerative 2.0x speedup Clustering Gauss Seidel K-Means on four cores Break Commutativity Speculation Dependences! Analysis No No Speedup Speedup Dependences Dependences can Dependences can are Imprecise be Reordered be Broken Rigorous Software Engineering Rigorous Software Engineering Microsoft Research, India Microsoft Research, India
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Parallelization Reconsidered βΊ β1 0 1 β¦ β¦ βΊ No DOALL Parallelism Are there dependences between loop iterations? Sequential program Yes Our Technique: SG3D Floyd-Warshall Agglomerative 2.0x speedup ALTER Clustering Gauss Seidel K-Means on four cores Break Commutativity Speculation Dependences! Analysis No No Speedup Speedup Dependences Dependences can Dependences can are Imprecise be Reordered be Broken Rigorous Software Engineering Rigorous Software Engineering Microsoft Research, India Microsoft Research, India
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Outline βΊ β1 0 1 β¦ β¦ βΊ β’ Breakable Dependences: Stale Reads β’ Deterministic Runtime System β’ Assisted Parallelization β’ Results *other details in the paper*
πΌ, π β πΌ β² , πβ² Breakable Dependences πΌ, π’ π β πΌ β² , π’,π β² - βΊ β1 0 1 β¦ β¦ in an Iterative Convergence Algorithm βΊ Examples: while(!converged) { β’ Floyd Warshall algorithm for i = 1 to n { refine(soln[i]) β’ Monotonic data-flow analyses } β’ Linear algebra solvers } β’ Stencil computations sequential ALTER: s tale reads privatized DO DO DO WHILE WHILE WHILE I (n) I (n) I (n) β¦ β¦ β¦ I (2) I (2) I (2) shared I (1) merge I (1) I (1) memory
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Stale Reads Execution Model βΊ β1 0 1 β¦ β¦ βΊ W 1 W 2 3 1 5 7 8 2 4 6 π 1 β© π 2 = * + Stale reads β’ Execution valid under staleReads model iff β Commit order is some serial order of iterations (can be different from sequential order) β Each iteration reads a stale but consistent snapshot β Staleness is bounded: no intersecting writes by intervening iterations Akin to Snapshot Isolation for databases
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Stale Reads with Reduction βΊ β1 0 1 β¦ β¦ βΊ π β π π β W 1 π 2 , π W 1 , π 2 2 1 3 1 5 7 8 2 4 6 π ) β© (π π ) = (π 1 β W 2 β π 1 2 π πππ£ππ’πππ π β π€ππ , π where 1. Every access to var is an update using operation O 2. Operator O is commutative and associative
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Deterministic Runtime System βΊ β1 0 1 β¦ β¦ βΊ state FORK() private private private β’ body(1) β’ body(2) β’ body(3) with RW with RW with RW 1 2 3 EXECUTE() logging logging logging Commit? 2 Commit? 1 3 JOIN() state StaleReads Commit(i): β π π‘π’.π<π π₯π ππ’ππ‘ π β© π₯π ππ’ππ‘ π = *+
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Alter Annotations βΊ β1 0 1 β¦ β¦ βΊ while(error < EPSILON) { //convergence loop error = 0.0; for(uint32_t i = 1; i < grid->xmax - 1; ++i) { [ StaleReads, (error, max)] for(uint32_t j = 1; j < grid->ymax - 1; ++j) { for(uin32_t k = 1; k < grid->zmax - 1; ++k) { oldValue = grid[i][j][k] grid[i][j][k] = a * grid[i][j][k] + b * AddDirectNbr(grid) + c * AddSquareNbr(grid) + d * AddCubeNbr(grid); error = max(error, (OldValue,GridPtr[i][j][k]))); } }
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Test Driven Parallelism Inference βΊ β1 0 1 β¦ β¦ βΊ Exhaustive parallelization engine β’ For each annotation run all Sequential test cases, record outcome Test suite program β’ outcome of a single run π‘π£ππππ‘π‘, πππππ£π π β ( crash, Exhaustive timeout, high contention, output parallelization engine mismatch ) ο Output mismatch: assertion Candidate Parallel failures or floating point program User difference < 0.01% validation
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Assisted Parallelism βΊ β1 0 1 β¦ β¦ βΊ ALTER Prior art Assisted parallelism Automatic parallelism Sequential Test suite Sequential program program Exhaustive Conservative parallelization engine Compiler analysis Candidate Parallel program User Parallel validation program Auto tune for perf Preserve program Preserve functionality dependences
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Benchmarks βΊ β1 0 1 β¦ β¦ βΊ BENCHMARK ALGORITHM TYPE PARALLELISM LOOP WGT AggloClust Branch & bound STALE READS 89% GSdense Dense algebra STALE READS 100% GSsparse Sparse algebra STALE READS 100% FloydWarshall Dynamic programming STALE READS 100% SG3D Structured grids STALE READS, (error, max) 96% BarnesHut N-body methods DOALL 99.6% FFT Spectral methods DOALL 100% HMM Graphical models DOALL 100% Genome Bioinformatics STALE READS 89% SSCA2 Scientific STALE READS 76% Data mining K-means STALE READS, (delta, +) 89% Engineering Labyrinth _ 99%
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Experimental Setup βΊ β1 0 1 β¦ β¦ βΊ β’ Experiments on a 2 x quad core Xeon processor β’ Alter transformations in Microsoft Phoenix compiler framework β’ Comparison with dependence speculation and manual parallelization of 2 applications
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Results : Baseline βΊ β1 0 1 β¦ β¦ βΊ 6 5 staleReads 4 OutOfOrder speculate 3 No scope for dependence DOALL 2 speculation 1 No scope for dependence 0 speculation
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Results : Alter βΊ β1 0 1 β¦ β¦ βΊ 6 5 staleReads 4 OutOfOrder speculate 3 DOALL 2 1 0
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Results: Manual Parallelization βΊ β1 0 1 β¦ β¦ βΊ 6 Good speedup with manual fine grain locking 5 staleReads Comparable 4 performance OutOfOrder 3 speculate DOALL 2 1 0
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - In the Paperβ¦ βΊ β1 0 1 β¦ β¦ βΊ β’ ALTER multi-process memory allocator β’ ALTER collections β’ Usage scenarioβs for ALTER β’ Profiling and instrumentation overhead β’ DOALL parallelism and speculation within ALTER
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Related Work βΊ β1 0 1 β¦ β¦ βΊ β’ Test-driven parallelization β QuickStep: similar testing methods for non-deterministic programs, offers accuracy bounds [Rinard 2010] β’ Assisted parallelization [Taylor 2011] [Tournavitis 2009] β Paralax: annotations improve precision of analysis, but dependences respected [Vandierendonck 2010] β’ Implicit parallelization [Burckhardt 2010] β Commutative annotation for reordering[August 2007, 11] β Optimistic execution of irregular programs [Pingali 2008] β As far as we know, stale reads execution model is new
πΌ, π β πΌ β² , πβ² πΌ, π’ π β πΌ β² , π’,π β² - Conclusions βΊ β1 0 1 β¦ β¦ βΊ β’ Breakable dependences must be exploited in order to parallelize certain classes of programs β’ We propose a new execution model, StaleReads , that violates dependences in a principled way β’ Adopt database notion of Snapshot Isolation for loop parallelization β’ ALTER is a compiler and deterministic runtime system that discovers new parallelism in programs β’ We believe tools for assisted parallelism can help to overcome the limits of automatic parallelization
Recommend
More recommend