Two Roads to Parallelism: Compilers and Libraries Lawrence Rauchwerger
Parallel Computing • It’s back (again) and ubiquitous • We have the hardware (multicore …. petascale) • Parallel software + Productivity: not yet… • And now ML needs it … Our Road towards a productive parallel software development environment 2
For Existing Serial Programs Previous Approaches - Use Instruction Level Parallelism (ILP): HW + SW ¡ compiler (automatic) BUT not scalable - Thread (Loop) Level (Data) Parallelism: HW+SW ¡ compiler (automatic) BUT insufficient coverage ¡ manual annotations more scalable but labor intensive Our Approach - Hybrid Analysis: A seamless bridge of static and dynamic program analysis for loop level parallelization ¡ USR - a powerful IR for irregular application ¡ Speculation as needed for dynamic analysis 3
For New Programs Previous Approaches - Write parallel programs from scratch - Use parallel language, library, annotations - Hard Work ! Our Approach - STAPL: Parallel Programming Environment Library of parallel algorithms, distributed containers, ¡ patterns and run-time system Used in PDT, an important app for DOE & Nuclear ¡ Engineers, influenced Intel’s TBB …and perhaps similar to Tensorflow ¡ 4
Parallelizing Compilers Auto-Parallelization of Sequential Programs - Around for 30+ years: UIUC, Rice, Stanford, KAI, etc. - Requires complex static analysis + other technology - Not widely adopted Our Approach - Initially: speculative parallelization - Better: Hybrid Analysis is best of both: static + dynamic - Aspects of these techniques used in mainstream compilers and STM based systems. - Excellent results – Major Effort – Don’t try at home 6
Static Data Dependence Analysis: An Essential Tool for Parallelization The Question: Are there cross iteration dependences? • Equivalent to determining if system of equations has integer solutions • In general, undecidable – until symbols become numbers (at runtime) 1 ≤ j w ≤ 10 Linear Reference Patterns DO j = 1, 10 1 ≤ j r ≤ 10 a(j) = a(j+40) - Solutions restricted to linear addressing and control j w ≠ j r ENDDO (mostly small kernels) j w = j r + 40 Geometric view: Polytope model ¡ • Some convex body contains no integral points Existential solutions: GCD Test, Banerjee Test, etc ¡ • Potentially overly conservative General solution: Presburger formula decidability ¡ • Omega Test: Precise, potentially slow Nonlinear Reference Patterns • Common cases: indirect access, recurrence without closed form DO j = 1, 10 • Approaches: Linear Approximation, Symbolic IF (x(j)>0) THEN Analysis, Interactive A(f(j)) = … ENDIF 7 ENDDO
Run-time Dependence Analysis: Speculative Parallelization Checkpoint FOR i = … Problem: A[W[i]] = A[R[i]] + C[i] Speculative parallel execution + tracing Main Idea : Analysis • Speculatively execute the loop in parallel and record reference in private shadow data Yes structures Success ? • Afterwards, check shadow data structures No for data dependences Restore • if no dependences loop was parallel • else re-execute safely (loop not parallel) Sequential execution Cost : End • Worst case: proportional to data size 9
Hybrid Analysis Compile-time Analysis Hybrid Analysis Run-time Analysis (compiler) Symbolic analysis STATIC Symbolic analysis Extract conditions DYNAMIC Evaluate conditions Full reference-by- (run-time) reference analysis PROs PROs PROs Always finds answers Always finds answers No run-time overhead Minimizes runtime overhead CONs CONs CONs More Complex static analysis Run-time overhead Conservative when Ignores compile-time Input/computed values analysis Indirection, Control Weak symbolic analysis 10 Complex recurrences Impractical Combinatorial explosion
DO j=1,N Hybrid Analysis Under what conditions can a(j)=a(j+40) the loop be executed in Compile-time Phase parallel? ENDDO x x 1. Collect and classify memory j+40 j=1,N j j=1,N references. READ WRITE 41:40+N 1:N 2. Aggregate them symbolically READ WRITE Empty? ∩ 3. Formulate independence test. 41:40+N 1:N READ WRITE ≤ ≤ 4.a) If we can prove 1 N 40 4.b) If N is unknown, 11 ≤ N 40 Declare loop parallel. Extract run-time test.
Hybrid Analysis DO j=1,N Execute the loop in parallel if a(j)=a(j+40) possible. Run-time Phase ENDDO 4.a) If we can prove 1 N 40 , 4.b) If N is unknown, ≤ ≤ ≤ N 40 Declare loop parallel. Extract run-time test. Compile Time Run Time Run-time Test IF (N 40) THEN ≤ DO PARALLEL j=1,N Parallel DO PARALLEL j=1,N Parallel a(j)=a(j+40) Loop a(j)=a(j+40) Loop ENDDO ENDDO ELSE DO j=1,N Sequential a(j)=a(j+40) Loop ENDDO No run-time tests ENDIF 12 performed if not necessary!
Hybrid Analysis: a slightly deeper dive DO j = 1, n A(j) = A(j+40) WRITE READ IF (x>0) THEN A(j) = A(j) + A(j+20) ENDIF ENDDO ∩ READ WRITE = Empty? Empty? ∩ READ ∪ Program Level WRITE 1:n Representation of References 41:40+n # ( USR) x>0 21:20+n 13
Set expression to DO j = 1, n A(j) = A(j+40) IF (x>0) THEN Logic expression A(j) = A(j) + A(j+20) ENDIF ENDDO Empty? ∧ ∩ Empty? Empty? WRITE ∪ ∩ ∩ 1:n READ 1. Distribute Intersection 41:40+n 1:n 41:40+n 1:n # # 21:20+n x>0 21:20+n x>0 2 ∧ ∧ (n 20 or x 0) ∨ 3 ∨ ≤ ≤ n 40 n 40 4 ≤ ≤ and n 40 ≤ Empty? n 20 x 0 x 0 ≤ ≤ ∩ ≤ 14 Representation is Key ! 21:20+n 1:n
Hybrid Analysis Strategy Independence conditions factored into a series of sufficient conditions tested at runtime in the order of their complexity previous O(1) Scalar example pass Operations fail O(n/k) Execute in Parallel aggregate Comparisons pass (independent) references reference fail based pass LRPD Execute Sequentially fail (dependent) 15
Hybrid Analysis Parallelization Coverage RT: Individual Refs RT: Aggregated Refs RT: Simple Checks Compile-time 100 80 60 40 20 0 adm arc2d bdna dyfesm flo52 mdg ocean spec77 track trfd applu apsi mgrid swim wupwise hydro2d matrix300 mdljdp2 nasa7 ora swm256 tomcatv PERFECT SPEC2000/06 Previous SPEC • Parallelized 380 loops of 2100 analyzed loops: 92% seq. coverage 17
Speedups: Hybrid Analysis vs. Intel ifort • Older Benchmarks with smaller datasets on 4 cores only • Better performance on 14/18 benchmarks on 4 cores • Better performance on 10/11 benchmarks on 8 cores 19
So…. - What did we accomplish? • Full Parallelization of C-tran codes (28 benchmarks at >90% coverage) • A IR representation & a technique - We cannot declare victory because: • Required Heroic Efforts • Commercial compilers adopt slowly • Compilers cannot create parallelism -- only programmers can! 20
How else? First • Think Parallel! Then • Develop parallel algorithms • Raise the level of abstraction • Use algorithm level (not only) abstraction • Expressivity + Productivity • Optimization can be compiler generated 21
STAPL : Standard Template Adaptive Parallel Library A library of parallel components that adopts the generic programming philosophy of the C++ Standard Template Library (STL). • STL • STAPL - - Iterators provide abstract access Views provide abstracted access to to data stored in Containers . distributed data stored in Distributed Containers . - Algorithms are sequences of - instructions that transform the data. Parallel Algorithms specified by Skeletons Run-time representation is Task Graph ¡ Containers Iterators Algorithms Containers Views Algorithms Task Graphs 23
STAPL Components User Application Code High Level of Abstraction ~ similar to C++ STL Task & Data parallelism: Asynchronous Views Algorithms • Parallelism (SPMD) implicit – Serialization Containers explicit • imperative + functional: Data flow+Containers Skeleton Adaptive Framework Framework SPMD Programs defined by Task Graph • Data Dependence Patterns è Skeletons • Composition: parallel, serial, nested, … Run-time System • Tasks: Work function & Data ARMI Communication Scheduler Performance Monitor Library • Fine grain tasks (coarsened) • Data in distributed containers MPI, OpenMP , Pthreads Execution Defined by: Data Flow Graphs (Task Graphs) Execution policies: scheduling, asynchrony.. 24 24 Distributed Memory Model (PGAS)
The STAPL Graph Library (SGL) • Many problems are modeled using graphs: - Web search, data-mining (Google, Youtube) - Social networks (Facebook, Google+, Twitter) - Geospatial graphs (Maps) - Scientific applications • Many important graph algorithms: - Breadth-first search, single-source shortest path, strongly connected components, k-core decomposition, centralities 27
SGL Programming Model Vertex Operator Neighbor Operator User code Graph Runtime Library code KLA Hierarchical Out-of-Core STAPL Runtime System OpenMP MPI C++11 threads 32
Parallel Graph Algorithms May Use • Asynchronous Model • Level-Synchronous Model - BSP-style iterative computation - Asynchronous task execution - Point-to-point synchronizations, possible - Global synchronization after each redundant work level, no redundant work Processors Processors Computation Tasks Local Interleaved Computation Communication Communication Barrier Synchronization 35
Recommend
More recommend