pervasive parallelism laboratory stanford university
play

Pervasive Parallelism Laboratory Stanford University Unleash full - PowerPoint PPT Presentation

Kunle Olukotun Pervasive Parallelism Laboratory Stanford University Unleash full power of future computing platforms Make parallel application development practical for the masses (Joe the programmer) Parallel applications without


  1. Kunle Olukotun Pervasive Parallelism Laboratory Stanford University

  2.  Unleash full power of future computing platforms  Make parallel application development practical for the masses (Joe the programmer)  Parallel applications without parallel programming

  3.  Heterogeneous parallel hardware  Computing is energy constrained  Specialization: energy and area efficient parallel computing  Must hide low-level issues from most programmers  Explicit parallelism won’t work (10K- 100K threads)  Only way to get simple and portable programs  No single discipline can solve all problems  Apps, PLs, runtime, OS, architecture  Need vertical integration  Hanrahan, Aiken, Rosenblum, Kozyrakis, Horowitz

  4.  Heterogeneous HW for energy efficiency Multi-core, ILP, threads, data-parallel engines, custom engines   H.264 encode study 1000 Performance Energy Savings 100 10 1 4 cores + ILP + SIMD + custom ASIC inst Source: Understanding Sources of Inefficiency in General-Purpose Chips (ISCA’10)

  5. Sun T2 Nvidia Fermi Driven by energy efficiency Altera FPGA Cray Jaguar

  6. Pthreads Sun OpenMP T2 CUDA Nvidia OpenCL Fermi Verilog Altera VHDL FPGA MPI Cray Jaguar

  7. Applications Pthreads Sun OpenMP T2 Scientific Engineering CUDA Virtual Nvidia OpenCL Worlds Fermi Personal Robotics Verilog Altera VHDL FPGA Data informatics MPI Cray Jaguar Too many different programming models

  8. It is possible to write one program and run it on all these machines

  9. Applications Pthreads Sun OpenMP T2 Ideal Parallel Scientific Engineering Programming Language CUDA Virtual Nvidia OpenCL Worlds Fermi Personal Robotics Verilog Altera VHDL FPGA Data informatics MPI Cray Jaguar Too many different programming models

  10. Performance Productivity Generality

  11. Performance Productivity Generality

  12. Performance Domain Specific Languages Productivity Generality

  13.  Domain Specific Languages (DSLs)  Programming language with restricted expressiveness for a particular domain  High-level and usually declarative

  14. Productivity • Shield average programmers from the difficulty of parallel programming • Focus on developing algorithms and applications and not on low level implementation details Performance • Match high level domain abstraction to generic parallel execution patterns • Restrict expressiveness to more easily and fully extract available parallelism • Use domain knowledge for static/dynamic optimizations Portability and forward scalability • DSL & Runtime can be evolved to take advantage of latest hardware features • Applications remain unchanged • Allows innovative HW without worrying about application portability

  15. Data Scientific Virtual Personal Applications informatics Engineering Worlds Robotics Domain Machine Specific Physics Data Probabilistic Learning Rendering ( Liszt ) Analysis (RandomT) Languages ( OptiML ) Domain Embedding Language ( Scala ) Polymorphic Embedding Staging Static Domain Specific Opt. DSL Infrastructure Parallel Runtime ( Delite, Sequoia, GRAMPS ) Dynamic Domain Spec. Opt. Task & Data Parallelism Locality Aware Scheduling Hardware Architecture Heterogeneous Specialized Cores OOO Cores SIMD Cores Threaded Cores Hardware Programmable Scalable Isolation & On-chip Pervasive Hierarchies Coherence Atomicity Networks Monitoring

  16.  We need to develop all of these DSLs  Current DSL methods are unsatisfactory

  17.  Stand-alone DSLs  Can include extensive optimizations  Enormous effort to develop to a sufficient degree of maturity  Actual Compiler/Optimizations  Tooling (IDE, Debuggers,…)  Interoperation between multiple DSLs is very difficult  Purely embedded DSLs ⇒ “just a library”  Easy to develop (can reuse full host language)  Easier to learn DSL  Can Combine multiple DSLs in one program  Can Share DSL infrastructure among several DSLs  Hard to optimize using domain knowledge  Target same architecture as host language Need to do better

  18.  Goal: Develop embedded DSLs that perform as well as stand-alone ones  Intuition: General-purpose languages should be designed with DSL embedding in mind

  19.  Mixes OO and FP paradigms  Targets JVM  Expressive type system allows powerful abstraction  Scalable language  Stanford/EPFL collaboration on leveraging Scala for parallelism  “Language Virtualization for Heterogeneous Parallel Computing” Onward 2010, Reno

  20. Embedded DSL gets it all for free, Modular Staging provides a hybrid approach but can’t change any of it DSLs adopt front-end from Stand-alone DSL but can customize IR and highly expressive implements everything participate in backend phases embedding language Type Code Lexer Parser Analysis Optimization checker gen Typical Compiler GPCE’10: Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs

  21. Liszt OptiML Provide a common IR that  program program can be extended while still benefitting from generic Delite Parallelism analysis and opt. Scala Embedding Framework Framework Extend common IR and  provide IR nodes that encode data parallel Intermediate Representation (IR) execution patterns ⇒ ⇒ Base IR Delite IR DS IR Now can do parallel  optimizations and Generic Parallelism Analysis, Domain mapping Analysis & Opt. Opt. & Mapping Analysis & Opt. DSL extends most  appropriate data parallel nodes for their operations Code Generation Now can do domain-  specific analysis and opt. Generate an execution Kernels  Delite Data Structures (Scala, C, graph, kernels and data Execution (arrays, trees, Cuda, MPI structures Graph graphs, …) Verilog, …)

  22. Maps the machine- Kernels Data  Cluster Delite (Scala, C, Structures agnostic DSL compiler Execution Cuda, (arrays, trees, Graph SMP GPU output onto the machine Verilog, …) graphs, …) configuration for Application Inputs Machine Inputs execution Walk-time scheduling Walk-Time  produces partial Code Generator schedules Scheduler Fusion, Specialization, Synchronization Code generation  produces fused, specialized kernels to be Partial schedules, Fused & specialized kernels launched on each resource Run-Time Run-time executor  controls and optimizes Schedule Dispatch, Dynamic load balancing, Memory management, Lazy data transfers, Kernel auto-tuning, Fault tolerance execution

  23.  Solvers for mesh-based PDEs  Complex physical systems  Huge domains  millions of cells  Example: Unstructured Reynolds- averaged Navier Stokes (RANS) solver Combustion  Goal: simplify code of mesh- Turbulence based PDE solvers Fuel injection  Write once, run on any type of Transition Thermal parallel machine  From multi-cores and GPUs to clusters Turbulence

  24.  Minimal Programming language  Aritmetic, short vectors, functions, control flow  Built-in mesh interface for arbitrary polyhedra  Vertex, Edge, Face, Cell  Optimized memory representation of mesh  Collections of mesh elements  Element Sets: faces(c:Cell), edgesCCW(f:Face)  Mapping mesh elements to fields  Fields: val vert_position = position(v)  Parallelizable iteration  forall statements: for( f <- faces(cell) ) { … }

  25. Simple Set Comprehension for(edge ¡<-­‑ ¡edges(mesh)) ¡{ ¡ Functions, Function Calls ¡ ¡ ¡val ¡flux ¡= ¡flux_calc(edge) ¡ ¡ ¡ ¡val ¡v0 ¡= ¡head(edge) ¡ Mesh Topology Operators ¡ ¡ ¡val ¡v1 ¡= ¡tail(edge) ¡ Field Data Storage ¡ ¡ ¡Flux(v0) ¡+= ¡flux ¡ ¡ ¡ ¡Flux(v1) ¡-­‑= ¡flux ¡ } ¡ Code contains possible write conflicts! We use architecture specific strategies guided by domain knowledge  MPI: Ghost cell-based message passing  GPU: Coloring-based use of shared memory

  26.  Using 8 cores per node, scaling up to 96 cores (12 nodes, 8 cores per node, all communication using MPI) MPI ¡Speedup ¡750k ¡Mesh ¡ MPI ¡Wall-­‑Clock ¡Run<me ¡ 120 ¡ 1000 ¡ Run<m ¡Log ¡Scale ¡(seconds) ¡ Speedup ¡over ¡Scalar ¡ 100 ¡ 80 ¡ 100 ¡ 60 ¡ 40 ¡ 10 ¡ 20 ¡ 0 ¡ 1 ¡ 0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡ 120 ¡ 0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡ 120 ¡ Number ¡of ¡MPI ¡Nodes ¡ Number ¡of ¡MPI ¡Nodes ¡ Linear ¡Scaling ¡ Liszt ¡Scaling ¡ Joe ¡Scaling ¡ Liszt ¡Run9me ¡ Joe ¡Run9me ¡

  27.  Scaling mesh size from 50K (unit-sized) cells to 750K (16x) on a Tesla C2050. Comparison is against single threaded runtime on host CPU (Core 2 Quad 2.66Ghz) GPU ¡Speedup ¡over ¡Single-­‑Core ¡ 35 ¡ 30 ¡ Speedup ¡over ¡Scalar ¡ 25 ¡ 20 ¡ 15 ¡ Speedup ¡Double ¡ Speedup ¡Float ¡ 10 ¡ 5 ¡ 0 ¡ 0 ¡ 2 ¡ 4 ¡ 6 ¡ 8 ¡ 10 ¡ 12 ¡ 14 ¡ 16 ¡ 18 ¡ Problem ¡Size ¡ Single-Precision: 31.5x, Double-precision: 28x

Recommend


More recommend