Simple, Efficient, Portable Decomposition of Simple, Efficient, Portable Decomposition of Large Data Sets Large Data Sets William Lundgren (wlundgren@gedae.com wlundgren@gedae.com, Gedae), , Gedae), William Lundgren ( David Erb (IBM), Max Aguilar (IBM), Kerry Barnes David Erb (IBM), Max Aguilar (IBM), Kerry Barnes (Gedae), James Steed (Gedae) (Gedae), James Steed (Gedae) HPEC 2008 HPEC 2008
Introduction Introduction The study of High Performance Computing is the study of The study of High Performance Computing is the study of – – How to move data into fast memory How to move data into fast memory – – How to process data when it is there How to process data when it is there Multicores like Cell/B.E. and Intel Core2 have hierarchical Multicores like Cell/B.E. and Intel Core2 have hierarchical memories memories – Small, fast memories close to the SIMD ALUs Small, fast memories close to the SIMD ALUs – – Large, slower memories offchip Large, slower memories offchip – Processing large data sets requires decomposition Processing large data sets requires decomposition – – Break data into pieces small enough for the local storage Break data into pieces small enough for the local storage – – Stream pieces through using multibuffering Stream pieces through using multibuffering 2 2
Cell/B.E. Memory Hierarchy Cell/B.E. Memory Hierarchy Each SPE core has a 256 kB local storage Each SPE core has a 256 kB local storage Each Cell/B.E. chip has a large system memory Each Cell/B.E. chip has a large system memory Cell/B.E. Chip Cell/B.E. Chip SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS EIB EIB Bridge PPE Bridge PPE SYSMEM SYSMEM Duplicate or heterogeneous Subsystems 3 3
Intel Quad Core Memory Intel Quad Core Memory Hierarchy Hierarchy Caching on Intel and other SMP multicores also creates Caching on Intel and other SMP multicores also creates memory hierarchy memory hierarchy System Bus Instruction Instruction Instruction Instruction Units Units Units Units Schedulers Schedulers Schedulers Schedulers Load/ ALUs Load/ ALUs Load/ ALUs Load/ ALUs Store Store Store Store L1 Cache L1 Cache L1 Cache L1 Cache 4 4
Optimization of Data Movement Optimization of Data Movement Optimize data movement using software Optimize data movement using software Upside Upside – Higher performance possibilities Higher performance possibilities – Downside Downside – Complexity beyond the reach of many programmers Complexity beyond the reach of many programmers – In analogy , introduction of Fortran and C In analogy , introduction of Fortran and C – The CPU was beyond the reach of many potential software The CPU was beyond the reach of many potential software – developers developers – – Fortran and C provide automatic compilation to assembly Fortran and C provide automatic compilation to assembly – – Spurred the industry Spurred the industry Multicores require the introduction of fundamentally new automation. 5 5
Gedae Background Gedae Background We can understand the problem by considering the We can understand the problem by considering the guiding principles of automation that effectively guiding principles of automation that effectively addresses the problem. addresses the problem. 6 6
Structure of Gedae Structure of Gedae Developer Functional Implementation Model Specification Analysis Tools Compiler Hardware Threaded Model Application Thread Manager SW / HW System 7 7
Guiding Principle for Evolution of Guiding Principle for Evolution of Multicore SW Development Tools Multicore SW Development Tools Functional Architecture- specific details model Compiler Complexity Libraries Implementation Implementation specification 8 8 8 8
Language – – Invariant Invariant Language Functionality Functionality Functionality must be free of implementation policy Functionality must be free of implementation policy – C and Fortran freed programmer from specifying details of moving – C and Fortran freed programmer from specifying details of moving data between memory, registers, and ALU data between memory, registers, and ALU – Extend this to multicore parallelism and memory structure Extend this to multicore parallelism and memory structure – The invariant functionality does not include multicore concerns The invariant functionality does not include multicore concerns like like – Data decomposition/tiling Data decomposition/tiling – – Thread and task parallelism Thread and task parallelism – Functionality must be easy to express Functionality must be easy to express – Scientist and engineers want a thinking tool Scientist and engineers want a thinking tool – Functional expressiveness must be complete Functional expressiveness must be complete – – Some algorithms are hard if the language is limited Some algorithms are hard if the language is limited 9 9
Language Features for Language Features for Expressiveness and Invariance Expressiveness and Invariance Stream data (time based data) * Stream data (time based data) * Stream segments with software reset on segment boundaries * Stream segments with software reset on segment boundaries * ‡ Persistent data – – extends from state* to databases extends from state* to databases ‡ Persistent data ‡ Algebraic equations (HLL most similar to Mathcad) ‡ Algebraic equations (HLL most similar to Mathcad) † Conditionals † Conditionals Iteration ‡ ‡ Iteration † State behavior † State behavior Procedural * Procedural * * These are mature language features * These are mature language features † These are currently directly supported in the language but will These are currently directly supported in the language but will continue to evolve continue to evolve † ‡ Support for directly expressing algebraic equations and iterati Support for directly expressing algebraic equations and iteration. while possible to implement in on. while possible to implement in ‡ the current tool, will be added to the language and compiler in the current tool, will be added to the language and compiler in the next major release. the next major release. Databases will be added soon after. Databases will be added soon after. 10 10
Library Functions Library Functions Black box functions hide essential functionality from compiler Black box functions hide essential functionality from compiler Library is a vocabulary with an implementation Library is a vocabulary with an implementation conv(float *in, float *out, int R, int C, conv(float *in, float *out, int R, int C, float *kernel, int KR, int KC); float *kernel, int KR, int KC); Algebraic language is a specification Algebraic language is a specification range i=0..R- -1, j=0..C 1, j=0..C- -1, i1=0..KR 1, i1=0..KR- -1, j1=0..KC 1, j1=0..KC- -1; 1; range i=0..R out[i][j] += in[i+i1][j+j1] * kernel[i1][j1]; out[i][j] += in[i+i1][j+j1] * kernel[i1][j1]; Other examples: As[i][j] += B[i+i1][j+j1]; /* kernel of ones */ Ae[i][j] |= B[i+i1][j+j1]; /* erosion */ Am[i][j] = As[i][j] > (Kz/2); /* majority operation */ 11 11
Library Functions Library Functions A simple example of hiding essential functionality is tile A simple example of hiding essential functionality is tile extraction from a matrix extraction from a matrix – – Software structure changes based on data size and target Software structure changes based on data size and target architecture architecture – Library hides implementation from developer and compiler Library hides implementation from developer and compiler – Option Tile Image in …Back to A Contiguous System CPU Data Process System in PPE cache Memory Reorg Tile Memory Option Tile B Transfer …Back to Contiguous Process Data System in SPE LS Tile Reorg Memory 12 12
Features Added to Increase Automation Features Added to Increase Automation of Example Presented at HPEC 2007 of Example Presented at HPEC 2007 13 13
New Features New Features New language features and compiler functionality provide New language features and compiler functionality provide increased automation of hierarchical memory management increased automation of hierarchical memory management Language features Language features – Tiled dimensions Tiled dimensions – – Iteration Iteration – – Pointer port types Pointer port types – Compiler functions Compiler functions – Application of stripmining to iteration Application of stripmining to iteration – – Inclusion of close Inclusion of close- -to to- -the the- -hardware List DMA to get/put tiles hardware List DMA to get/put tiles – – Multibuffering Multibuffering – – Accommodation of memory alignment requirements of SPU and Accommodation of memory alignment requirements of SPU and – DMA DMA 14 14
Matrix Multiplication Algorithm Matrix Multiplication Algorithm 15 15
Recommend
More recommend