Pervasive Parallelism Laboratory Stanford University Unleash full - PowerPoint PPT Presentation

Kunle Olukotun Pervasive Parallelism Laboratory Stanford University

 Unleash full power of future computing platforms  Make parallel application development practical for the masses (Joe the programmer)  Parallel applications without parallel programming

 Heterogeneous parallel hardware  Computing is energy constrained  Specialization: energy and area efficient parallel computing  Must hide low-level issues from most programmers  Explicit parallelism won’t work (10K- 100K threads)  Only way to get simple and portable programs  No single discipline can solve all problems  Apps, PLs, runtime, OS, architecture  Need vertical integration  Hanrahan, Aiken, Rosenblum, Kozyrakis, Horowitz

 Heterogeneous HW for energy efficiency Multi-core, ILP, threads, data-parallel engines, custom engines   H.264 encode study 1000 Performance Energy Savings 100 10 1 4 cores + ILP + SIMD + custom ASIC inst Source: Understanding Sources of Inefficiency in General-Purpose Chips (ISCA’10)

Sun T2 Nvidia Fermi Driven by energy efficiency Altera FPGA Cray Jaguar

Pthreads Sun OpenMP T2 CUDA Nvidia OpenCL Fermi Verilog Altera VHDL FPGA MPI Cray Jaguar

Applications Pthreads Sun OpenMP T2 Scientific Engineering CUDA Virtual Nvidia OpenCL Worlds Fermi Personal Robotics Verilog Altera VHDL FPGA Data informatics MPI Cray Jaguar Too many different programming models

It is possible to write one program and run it on all these machines

Applications Pthreads Sun OpenMP T2 Ideal Parallel Scientific Engineering Programming Language CUDA Virtual Nvidia OpenCL Worlds Fermi Personal Robotics Verilog Altera VHDL FPGA Data informatics MPI Cray Jaguar Too many different programming models

Performance Productivity Generality

Performance Domain Specific Languages Productivity Generality

 Domain Specific Languages (DSLs)  Programming language with restricted expressiveness for a particular domain  High-level and usually declarative

Productivity • Shield average programmers from the difficulty of parallel programming • Focus on developing algorithms and applications and not on low level implementation details Performance • Match high level domain abstraction to generic parallel execution patterns • Restrict expressiveness to more easily and fully extract available parallelism • Use domain knowledge for static/dynamic optimizations Portability and forward scalability • DSL & Runtime can be evolved to take advantage of latest hardware features • Applications remain unchanged • Allows innovative HW without worrying about application portability

Data Scientific Virtual Personal Applications informatics Engineering Worlds Robotics Domain Machine Specific Physics Data Probabilistic Learning Rendering ( Liszt ) Analysis (RandomT) Languages ( OptiML ) Domain Embedding Language ( Scala ) Polymorphic Embedding Staging Static Domain Specific Opt. DSL Infrastructure Parallel Runtime ( Delite, Sequoia, GRAMPS ) Dynamic Domain Spec. Opt. Task & Data Parallelism Locality Aware Scheduling Hardware Architecture Heterogeneous Specialized Cores OOO Cores SIMD Cores Threaded Cores Hardware Programmable Scalable Isolation & On-chip Pervasive Hierarchies Coherence Atomicity Networks Monitoring

 We need to develop all of these DSLs  Current DSL methods are unsatisfactory

 Stand-alone DSLs  Can include extensive optimizations  Enormous effort to develop to a sufficient degree of maturity  Actual Compiler/Optimizations  Tooling (IDE, Debuggers,…)  Interoperation between multiple DSLs is very difficult  Purely embedded DSLs ⇒ “just a library”  Easy to develop (can reuse full host language)  Easier to learn DSL  Can Combine multiple DSLs in one program  Can Share DSL infrastructure among several DSLs  Hard to optimize using domain knowledge  Target same architecture as host language Need to do better

 Goal: Develop embedded DSLs that perform as well as stand-alone ones  Intuition: General-purpose languages should be designed with DSL embedding in mind

 Mixes OO and FP paradigms  Targets JVM  Expressive type system allows powerful abstraction  Scalable language  Stanford/EPFL collaboration on leveraging Scala for parallelism  “Language Virtualization for Heterogeneous Parallel Computing” Onward 2010, Reno

Embedded DSL gets it all for free, Modular Staging provides a hybrid approach but can’t change any of it DSLs adopt front-end from Stand-alone DSL but can customize IR and highly expressive implements everything participate in backend phases embedding language Type Code Lexer Parser Analysis Optimization checker gen Typical Compiler GPCE’10: Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs

Liszt OptiML Provide a common IR that  program program can be extended while still benefitting from generic Delite Parallelism analysis and opt. Scala Embedding Framework Framework Extend common IR and  provide IR nodes that encode data parallel Intermediate Representation (IR) execution patterns ⇒ ⇒ Base IR Delite IR DS IR Now can do parallel  optimizations and Generic Parallelism Analysis, Domain mapping Analysis & Opt. Opt. & Mapping Analysis & Opt. DSL extends most  appropriate data parallel nodes for their operations Code Generation Now can do domain-  specific analysis and opt. Generate an execution Kernels  Delite Data Structures (Scala, C, graph, kernels and data Execution (arrays, trees, Cuda, MPI structures Graph graphs, …) Verilog, …)

Maps the machine- Kernels Data  Cluster Delite (Scala, C, Structures agnostic DSL compiler Execution Cuda, (arrays, trees, Graph SMP GPU output onto the machine Verilog, …) graphs, …) configuration for Application Inputs Machine Inputs execution Walk-time scheduling Walk-Time  produces partial Code Generator schedules Scheduler Fusion, Specialization, Synchronization Code generation  produces fused, specialized kernels to be Partial schedules, Fused & specialized kernels launched on each resource Run-Time Run-time executor  controls and optimizes Schedule Dispatch, Dynamic load balancing, Memory management, Lazy data transfers, Kernel auto-tuning, Fault tolerance execution

 Solvers for mesh-based PDEs  Complex physical systems  Huge domains  millions of cells  Example: Unstructured Reynolds- averaged Navier Stokes (RANS) solver Combustion  Goal: simplify code of mesh- Turbulence based PDE solvers Fuel injection  Write once, run on any type of Transition Thermal parallel machine  From multi-cores and GPUs to clusters Turbulence

 Minimal Programming language  Aritmetic, short vectors, functions, control flow  Built-in mesh interface for arbitrary polyhedra  Vertex, Edge, Face, Cell  Optimized memory representation of mesh  Collections of mesh elements  Element Sets: faces(c:Cell), edgesCCW(f:Face)  Mapping mesh elements to fields  Fields: val vert_position = position(v)  Parallelizable iteration  forall statements: for( f <- faces(cell) ) { … }

Simple Set Comprehension for(edge ¡<-‑ ¡edges(mesh)) ¡{ ¡ Functions, Function Calls ¡ ¡ ¡val ¡flux ¡= ¡flux_calc(edge) ¡ ¡ ¡ ¡val ¡v0 ¡= ¡head(edge) ¡ Mesh Topology Operators ¡ ¡ ¡val ¡v1 ¡= ¡tail(edge) ¡ Field Data Storage ¡ ¡ ¡Flux(v0) ¡+= ¡flux ¡ ¡ ¡ ¡Flux(v1) ¡-‑= ¡flux ¡ } ¡ Code contains possible write conflicts! We use architecture specific strategies guided by domain knowledge  MPI: Ghost cell-based message passing  GPU: Coloring-based use of shared memory

 Using 8 cores per node, scaling up to 96 cores (12 nodes, 8 cores per node, all communication using MPI) MPI ¡Speedup ¡750k ¡Mesh ¡ MPI ¡Wall-‑Clock ¡Run<me ¡ 120 ¡ 1000 ¡ Run<m ¡Log ¡Scale ¡(seconds) ¡ Speedup ¡over ¡Scalar ¡ 100 ¡ 80 ¡ 100 ¡ 60 ¡ 40 ¡ 10 ¡ 20 ¡ 0 ¡ 1 ¡ 0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡ 120 ¡ 0 ¡ 20 ¡ 40 ¡ 60 ¡ 80 ¡ 100 ¡ 120 ¡ Number ¡of ¡MPI ¡Nodes ¡ Number ¡of ¡MPI ¡Nodes ¡ Linear ¡Scaling ¡ Liszt ¡Scaling ¡ Joe ¡Scaling ¡ Liszt ¡Run9me ¡ Joe ¡Run9me ¡

 Scaling mesh size from 50K (unit-sized) cells to 750K (16x) on a Tesla C2050. Comparison is against single threaded runtime on host CPU (Core 2 Quad 2.66Ghz) GPU ¡Speedup ¡over ¡Single-‑Core ¡ 35 ¡ 30 ¡ Speedup ¡over ¡Scalar ¡ 25 ¡ 20 ¡ 15 ¡ Speedup ¡Double ¡ Speedup ¡Float ¡ 10 ¡ 5 ¡ 0 ¡ 0 ¡ 2 ¡ 4 ¡ 6 ¡ 8 ¡ 10 ¡ 12 ¡ 14 ¡ 16 ¡ 18 ¡ Problem ¡Size ¡ Single-Precision: 31.5x, Double-precision: 28x

Pervasive Parallelism Laboratory Stanford University Unleash full - PowerPoint PPT Presentation

Kunle Olukotun Pervasive Parallelism Laboratory Stanford University Unleash full power of future computing platforms Make parallel application development practical for the masses (Joe the programmer) Parallel applications without

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Pervasive Devices Pervasive Devices: Low memory, few gates Low power, no clock, little

Pervasive Computing: Opportunities and Challenges Dimitris Kalofonos Pervasive Computing Group

Security for Pervasive Computing CS239 Kevin Eustice V. Ramakrishna 4/24/06 What is Pervasive

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Scaling-up SLA Monitoring in Scaling-up SLA Monitoring in Pervasive Environments Pervasive

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

Welcome Chairs:) PLC Process Check 1. What is it we want our students

Library Superstar Thursday, July 23, 2015 Lydia Thorne Search Operators A Review What is

2 3 3G / 4G 3G / 4G CENTRAFUSE CORE 4 5 It isnt about volume Value to customer

Spectral Experts for Estimating Mixtures of Linear Regressions Arun Tejasvi Chaganty Percy Liang

HPCTookit Update 2009 John Mellor-Crummey Nathan Tallent Mark Krentel Laksono Adhianto Mike

Site Report on Physics Plans and ILDG Usage for US Balint Joo Jefferson Lab Machines used for

Brief outline From CR to SCR Capture-Recapture Spatial Capture-Recapture (SCR) Models SCR Model

A Formalised Lower Bound on Undirected Graph Reachability Ulrich Schpp University of Munich