pvtol designing portability productivity and performance
play

PVTOL: Designing Portability, Productivity and Performance for - PowerPoint PPT Presentation

PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures Hahn Kim, Nadya Bliss, Jim Daly, Karen Eng, Jeremiah Gale, James Geraci, Ryan Haney, Jeremy Kepner, Sanjeev Mohindra, Sharon Sacco, Eddie Rutledge HPEC 2008


  1. PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures Hahn Kim, Nadya Bliss, Jim Daly, Karen Eng, Jeremiah Gale, James Geraci, Ryan Haney, Jeremy Kepner, Sanjeev Mohindra, Sharon Sacco, Eddie Rutledge HPEC 2008 25 September 2008 MIT Lincoln Laboratory This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendataions are those of the author and are not necessarily endorsed by the United States Government. PVTOL-1 HGK 9/25/08

  2. Outline • Background – Motivation Multicore Processors – – Programming Challenges • Tasks & Conduits • Maps & Arrays • Results • Summary MIT Lincoln Laboratory PVTOL-2 HGK 9/25/08

  3. SWaP* for Real-Time Embedded Systems • Modern DoD sensors continue to increase in fidelity and sampling rates • Real-time processing will always be a requirement Decreasing SWaP U-2 Global Hawk Modern sensor platforms impose tight SWaP requirements on real-time embedded systems MIT Lincoln Laboratory PVTOL-3 * SWaP = Size, Weight and Power HGK 9/25/08

  4. Embedded Processor Evolution High Performance Embedded Processors 10000 i860 GPU SHARC PowerPC PowerXCell 8i MFLOPS / Watt PowerPC with AltiVec Cell Cell (estimated) 1000 MPC7447A MPC7410 MPC7400 100 SHARC 750 10 i860 XR 603e • Multicore processor • 1 PowerPC core 8 SIMD cores 1990 2000 2010 Year • 20 years of exponential growth in FLOPS / W Multicore processors help • achieve performance Must switch architectures every ~5 years • requirements within tight Current high performance architectures are multicore SWaP constraints MIT Lincoln Laboratory PVTOL-4 HGK 9/25/08

  5. Parallel Vector Tile Optimizing Library • PVTOL is a portable and scalable middleware library for multicore processors • Enables unique software development process for real-time signal processing applications Cluster Embedded Desktop Computer 3. Deploy code 1. Develop serial code 2. Parallelize code 4. Automatically parallelize code Make parallel programming as easy as serial programming MIT Lincoln Laboratory PVTOL-5 HGK 9/25/08

  6. PVTOL Architecture Tasks & Conduits Concurrency and data movement Maps & Arrays Distribute data across processor and memory hierarchies Functors Abstract computational kernels into objects Portability: Runs on a range of architectures Performance: Achieves high performance Productivity: Minimizes effort at user level MIT Lincoln Laboratory PVTOL-6 HGK 9/25/08

  7. Outline • Background • Tasks & Conduits • Maps & Arrays • Results • Summary MIT Lincoln Laboratory PVTOL-7 HGK 9/25/08

  8. Multicore Programming Challenges Inside the Box Outside the Box Desktop Embedded Cluster Embedded Board Multicomputer • • Threads Processes – Pthreads – MPI (MPICH, Open MPI, etc.) – OpenMP – Mercury PAS • • Shared memory Distributed memory – Pointer passing – Message passing – Mutexes, condition variables PVTOL provides consistent semantics for both multicore and cluster computing MIT Lincoln Laboratory PVTOL-8 HGK 9/25/08

  9. Tasks & Conduits • Tasks provide concurrency load(B) DIT – Collection of 1+ threads in 1+ cdt1.write(B) Disk processes cdt1 – Tasks are SPMD, i.e. each thread runs task code • Task Maps specify locations of A B cdt1.read(B) DAT Tasks A = B cdt2.write(A) • Conduits move data Safely move data cdt2 – – Multibuffering cdt2.read(A) DOT – Synchronization save(A) DIT Read data from source (1 thread) DAT Process data (4 threads) DOT Output results (1 thread) Conduits Connect DIT to DAT and DAT to DOT MIT Lincoln Laboratory PVTOL-9 * DIT – Data Input Task, DAT – Data Analysis Task, DOT – Data Output Task HGK 9/25/08

  10. Pipeline Example DIT-DAT-DOT int main(int argc, char** argv) Main function creates tasks, { connects tasks with conduits and // Create maps (omitted for brevity) launches the task computation ... // Create the tasks Task<Dit> dit("Data Input Task", ditMap); Task<Dat> dat("Data Analysis Task", datMap); dit dat dot Task<Dot> dot("Data Output Task", dotMap); // Create the conduits Conduit<Matrix <double> > ab("A to B Conduit"); ab bc Conduit<Matrix <double> > bc("B to C Conduit"); // Make the connections dit.init(ab.getWriter()); dat.init(ab.getReader(), bc.getWriter()); dot.init(bc.getReader()); ab bc dit dat dot // Complete the connections ab.setupComplete(); bc.setupComplete(); // Launch the tasks dit.run(); dat.run(); dot.run(); // Wait for tasks to complete dit.waitTillDone(); dat.waitTillDone(); dot.waitTillDone(); } MIT Lincoln Laboratory PVTOL-10 HGK 9/25/08

  11. Pipeline Example Data Analysis Task (DAT) class Dat Tasks read and write data { using Reader and Writer private: interfaces to Conduits Conduit<Matrix <double> >::Reader m_Reader; Conduit<Matrix <double> >::Writer m_Writer; public: Readers and Writer provide void init(Conduit<Matrix <double> >::Reader& reader, handles to data buffers Conduit<Matrix <double> >::Writer& writer) { // Get data reader for the conduit reader.setup(tr1::Array<int, 2>(ROWS, COLS)); reader m_Reader = reader; // Get data writer for the conduit writer.setup(tr1::Array<int, 2>(ROWS, COLS)); writer m_Writer = writer; } void run() { Matrix <double>& B = m_Reader.getData(); Matrix <double>& A = m_Writer.getData(); B A A = B; reader writer A = B m_reader.releaseData(); m_writer.releaseData(); } }; MIT Lincoln Laboratory PVTOL-11 HGK 9/25/08

  12. Outline • Background • Tasks & Conduits • Maps & Arrays – Hierarchy – Functors • Results • Summary MIT Lincoln Laboratory PVTOL-12 HGK 9/25/08

  13. Map-Based Programming Technology Organization Language Year • A map is an assignment of blocks of data to processing MIT-LL* C++ 2000 Parallel Vector Library elements pMatlab MIT-LL MATLAB 2003 • Maps have been demonstrated in several technologies VSIPL++ C++ 2006 HPEC-SI † Map Map Map grid: 1x2 grid: 1x2 grid: 1x2 Grid specification dist: block- dist: block dist: cyclic Distribution together with cyclic procs: 0:1 procs: 0:1 specification processor list procs: 0:1 describes describe where how data are data are distributed distributed Cluster Cluster Cluster Proc Proc Proc Proc Proc Proc 0 1 0 1 0 1 MIT Lincoln Laboratory † High Performance Embedded Computing Software Initiative PVTOL-13 * MIT Lincoln Laboratory HGK 9/25/08

  14. PVTOL Machine Model • • Processor Hierarchy Memory Hierarchy – Processor: – Each level in the processor hierarchy can have its own Scheduled by OS memory Co-processor: – Dependent on processor for program control Disk CELL Cluster Remote Processor Memory Processor Local Processor Memory CELL CELL 0 1 Cache/Loc. Co-proc Mem. Read register SPE 0 SPE 1 … SPE 0 SPE 1 … Write register Registers Read data Write data Write Co-processor PVTOL extends maps to support hierarchy MIT Lincoln Laboratory PVTOL-14 HGK 9/25/08

  15. PVTOL Machine Model • • Processor Hierarchy Memory Hierarchy – Processor: – Each level in the processor hierarchy can have its own Scheduled by OS memory Co-processor: – Dependent on processor for program control Disk x86 Cluster Remote Processor Memory Processor Local Processor Memory x86/PPC x86/PPC 0 1 Cache/Loc. Co-proc Mem. GPU / GPU / GPU / GPU / Read register … … FPGA 0 FPGA 1 FPGA 0 FPGA 1 Write register Registers Read data Write data Write Co-processor Semantics are the same across different architectures MIT Lincoln Laboratory PVTOL-15 HGK 9/25/08

  16. Hierarchical Maps and Arrays Serial PPC • PVTOL provides hierarchical maps and arrays • Hierarchical maps concisely describe Parallel data distribution at each level PPC • Hierarchical arrays hide details of the Cluster grid: 1x2 processor and memory hierarchy dist: block procs: 0:1 PPC PPC 0 1 Program Flow Program Flow 1. Define a Block 1. Define a Block Hierarchical • Data type, index layout (e.g. row-major) • Data type, index layout (e.g. row-major) CELL Cluster 2. Define a Map for each level in the 2. Define a Map for each level in the grid: 1x2 hierarchy hierarchy dist: block procs: 0:1 • Grid, data distribution, processor list • Grid, data distribution, processor list CELL CELL 0 1 3. Define an Array for the Block 3. Define an Array for the Block grid: 1x2 dist: block 4. Parallelize the Array with the Hierarchical 4. Parallelize the Array with the Hierarchical procs: 0:1 Map (optional) Map (optional) SPE 0 SPE 1 SPE 0 SPE 1 5. Process the Array 5. Process the Array block: 1x2 LS LS LS LS MIT Lincoln Laboratory PVTOL-16 HGK 9/25/08

Recommend


More recommend