The MareIncognito Project Jesus Labarta Director Computer Sciences Research Dept. BSC Objective • Design a 10+ Petaflops Supercomputer for 2010-11 • - Cooperation • Spanish position with PRACE 1 tier GEN CI Ecosystem tier 0 Principal Partners General Partners Jesus Labarta. Keynote @ Scicomp 2009 2 1
Mare Incognito “Homogeneous” Supercomputer based on the Cell processor We believe it is possible to build “cheap”, “efficient”, “not application specific” and “widely applicable” machines We know it is risky We have “a vision” of relevant technologies to develop Many of them are not Cell specific and will be evaluated for other architectures The Opportunity To integrate all research lines within BSC and to increase our cooperation with IBM Influence the design and use of supercomputers in the future Jesus Labarta. Keynote @ Scicomp 2009 3 MareIncognito: Project structure 4 relevant apps: StarSs: CellSs, SMPSs Materials: SIESTA OpenMP@Cell Geophisics imaging: RTM OpenMP++ Comp. Mechanics: ALYA MPI + OpenMP/StarSs Plasma: EUTERPE Programming General kernels Applications models Coordinated scheduling: Run time, Automatic analysis Process, Coarse/fine grain prediction Models and Load Performance Job Sampling Power efficiency prototype balancing analysis tools Clustering Integration with Peekperf Processor Interconnect and node Contribution to new Cell design Contention Support for programming model Collectives Support for load balancing Overlap computation/communication Support for performance tools Issues for future processors Jesus Labarta. Keynote @ Scicomp 2009 4 2
Vision, work and findings • General Concerns: • Heterogeneous / hierarchical / dynamic trend • Memory • Variance • Are we overdimensioning our systems? • Globalization • Holistic design, Butterfly effect • Workpackages • Programming models • Node design • Load balance • Interconnects Jesus Labarta. Keynote @ Scicomp 2009 5 Heterogeneous, hierarchical and dynamic environment • Foreseeable plethora of architectures • Thick nodes • Driven by what can be done • Heterogeneous • Functionality • Performance • On purpose • Result of manufacturing process • Result of infrastructure construction process (~no cathedrals in pure architectonic style) • Hierarchical • Can not provide flat uniform view (latency and bandwidth) • Hierarchical domains support different granularities • Dynamic • Application characteristics • Workload • Resource allocation practices Jesus Labarta. Keynote @ Scicomp 2009 6 3
Memory: more than a wall Processor-DRAM Gap (latency) 1000 Performance CPU • Performance: • 100 Latency • 10 bandwidth • DRAM Cost 1 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 • Power D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998 • Capacity • Real usage < 40% ? • Accelerator model � 2x ? • Main component/nightmare of programming model Jesus Labarta. Keynote @ Scicomp 2009 7 Fighting variance: a lost battle • We are going to experience huge variability • In resources (availability, performance) and usage needs • In space and time • Better learn how to tolerate it • How to face it • Adaptive/Dynamic resource management • Load balance • Asynchronism Stolen from: V. Salapura “Scaling up next Generation Supercomputers”. UPC Jesus Labarta. Keynote @ Scicomp 2009 8 4
Are our designs an overkill? • Are we using resources efficiently? • Resources: • Processors • Memory • interconnect • Energy a “To kill flies with guns” Jesus Labarta. Keynote @ Scicomp 2009 9 The butterfly effect • Sensitivity to initial conditions • Huge impacts of small causes • High non linearities with accumulative effects a “Does the flap of a butterfly’s wings in Brazil set off a tornado in Texas?” Jesus Labarta. Keynote @ Scicomp 2009 10 5
Globalization: Holistic approach • Can we develop a unified theory/model? Nicely integrate all levels and experiences? • How do we ensure coordination/cooperation between levels at run time? Latency Dependences Can you imagine how Malleability Address spaces would it be if there Off chip Yes, he was no distance Tools Bandwidth can! Asynchronism Portability Resilience L o c I/O a if everything l i t y Network Scalability contention was here? Programmability Power Memory usage Replication Load Algorithms Balance Jesus Labarta. Keynote @ Scicomp 2009 11 Programming model Jesus Labarta. Keynote @ Scicomp 2009 12 6
Back to Babel? Book of Genesis The computer age “Now the whole earth had Fortran & MPI one language and the same words” … …”Come, let us make bricks, and burn them thoroughly. ”… …"Come, let us build ++ ourselves a city, and a tower with its top in the heavens, and let us make a name for ourselves”… And the LORD said, "Look, they are one people, Cilk++ Fortress and they have all one language; and this is only X10 CUDA Sisal the beginning of what they will do; nothing that HPF RapidMind StarSs they propose to do will now be impossible for Sequoia OpenMP them. Come, let us go down, and confuse their CAF ALF UPC language there, so that they will not understand SDK Chapel MPI one another's speech." Jesus Labarta. Keynote @ Scicomp 2009 13 Programming model for MareIncognito? • How much pressure can we put on our (BSC) program developers? • Is effort worthwhile? “long term”? • Only once porting effort • Can not keep spending 6 months and increasing by 15% the number of lines for each new target machine • Need smooth transition path • Can not fire our programmers and they are often not very flexible • can afford work on “beneficial” transformations • Programming Blocking Applications models • Better understanding of inputs and outputs • Understanding potential asynchrony Models Load • Performance and Evolution towards mixed mode balancing tools prototype • MPI+OpenMP, MPI + StarSs , MPI+StarSs+OpenCL, … • Processor Matches hierarchy in architectures, algorithmic, Interconnect and node • Other potential benefits: load balance. Jesus Labarta. Keynote @ Scicomp 2009 14 7
A perspective on architectures and programming models ≅ ≅ Grid ns � � 100 useconds minutes/hours Mapping of concepts: Instructions � Block operations � Full binary Functional units � � SPUs machines Fetch &decode unit � � PPE home machine Registers (name space) � Main memory � Files Registers (storage) � SPU memory � Files Granularity Stay sequential Just look at things from a bit further away Architects do know how to run parallel Jesus Labarta. Keynote @ Scicomp 2009 15 StarSs ∗ Ss GridSs CellSs GPUSs SMPSs • Programability • Portability – Standard sequential look and feel (C, – Runtime for each type of target Fortran) platform. – Incremental parallelization/restructure • Matches computations to resources • Achieves “decent” performance – Abstract/separate algorithmic issues – Even to sequential platform from resources – Single source for maintained version of – Methodology/practices a application • Block algorithms: modularity • “No” side effects: local addressing • Promote visibility of “Main” data • Perfornance • Explicit synchronization variables – Runtime intelligence Jesus Labarta. Keynote @ Scicomp 2009 16 8
Cell superscalar (CellSs) • Directives to define tasks in sequential block algorithm • Automatic parallelism exploitation at run time int main(){ … for (i=0; i < N; i++) for (j=0; j < N; j++) #pragma css task input(A, B) inout(C) for (k=0; k < N; k++) static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) { block_addmultiply( C[i][j], A[i][k], B[k][j]); … ... for (i=0; i < BS; i++) for (j=0; j < BS; j++) for (k=0; k < BS; k++) C[i][j] += A[i][k] * B[k][j]; } Jesus Labarta. Keynote @ Scicomp 2009 17 CellSs execution model PPU main thread Helper thread CellSs PPU lib User Data dependence main Data renaming program Scheduling SPU 0 SPU 1 CellSs SPU lib Work SPU 2 assignment DMA in Synchronization Task execution DMA out Synchronization Finalization User Task graph signal data Original task Tasks code Renaming Stage in/out data Memory SPE threads IFU DEC REN IQ ISS REG FU FU FU Helper thread Main thread RET Jesus Labarta. Keynote @ Scicomp 2009 18 9
Recommend
More recommend