roadrunner
play

Roadrunner: What makes it tick? Los Alamos Computer Science - PowerPoint PPT Presentation

LA-UR-08-6246 Roadrunner: What makes it tick? Los Alamos Computer Science Symposium October 14, 2008 Ken Koch Roadrunner Technical Manager, Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory Work


  1. LA-UR-08-6246 Roadrunner: What makes it tick? Los Alamos Computer Science Symposium October 14, 2008 Ken Koch Roadrunner Technical Manager, Computer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory Work presented was performed by a large team of Roadrunner project staff! Work presented was performed by a large team of Roadrunner project staff! Operated by the Los Alamos National Security, LLC for the DOE/NNSA IBM Confidential

  2. Slide 2 The messages this talk will convey are: • Why Roadrunner? Why Cell? • A bold but important step toward the future • What does Roadrunner look like? • Cluster-of-clusters with node-attached Cells • Concepts for Programming Roadrunner • MPI, Opteron+Cell, “local-store” memory & DMA transfers • Status and plans for Roadrunner • Unclassified Science opportunities Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  3. The Cell Processor a harbinger of the future IBM Confidential Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  4. Slide 4 Microprocessor trends are changing • Moore’s law still holds, but is now being realized differently • Frequency, power, & instruction- level-parallelism (ILP) have all Montecito plateaued transistors • Multi-core is here today and many- core ( ≥ 32 ) looks to be the future • Memory bandwidth and capacity per Pentium core are headed downward (caused clock 386 by increased core counts) power • Key findings of Jan. 2007 IDC Study: “Next Phase in HPC” ILP • new ways of dealing with parallelism will be required • must focus more heavily on bandwidth (flow of data) and less on processor From Burton Smith, LASCI-06 keynote, with permission Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  5. We are programming thousands of processors with MPI cluster Message Passing Message Passing High protocol overhead High protocol overhead Large granularity Large granularity Symmetric Symmetric Synchronous Synchronous node Slide 5 Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  6. Future supercomputers will require new programming models cluster Message Passing Message Passing High protocol overhead High protocol overhead Large granularity Large granularity Symmetric Symmetric Synchronous Synchronous node Not Message Passing Not Message Passing Parallelism and heterogeneity Parallelism and heterogeneity require new approaches: require new approaches: Threads, OpenMP, Threads, OpenMP, Accelerators … Accelerators … socket Slide 6 Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  7. Slide 7 The Cell processor is an (8+1)-way heterogeneous parallel processor SPU SPE • Cell Broadband Engine (CBE*) developed by Sony-Toshiba-IBM • used in Sony PlayStation 3 • 8 Synergistic Processing Elements (SPEs) • 128-bit vector engines • 256 kB local memory (LS = Local Store) • Direct Memory Access (DMA) engine (25.6 GB/s each) PowerPC • Chip interconnect (EIB) to • Run SPE-code as POSIX threads memory to PCIe (SPMD, MPMD, streaming) • PowerPC PPE runs Linux OS • Current Cell performance: • 204.8 GF/s SP & 13.65 GF/s DP • 512 MB @ 25.6 GB/s XDR memory • Insufficient for a Petaflop/s machine * trademark of Sony Computer Entertainment, Inc. Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  8. Slide 8 IBM is creating new Cell processors Next Gen (2PPE’+32SPE’) 45nm SOI ~1 TF-SP (est.) Performance Enhancements/ Scaling Path Enhanced Enhanced Cell Cell PowerXCell 8i chip: (1+8eDP SPE) (1+8eDP SPE) 65nm SOI To be used in Roadrunner 65nm SOI 102.4 GF/s double precision 4 GB DDR2 @ 25.6 GB/s Cost Cell BE Cell BE Cell/B.E. Cell/B.E. Reduction Continued (1+8) (1+8) (1+8) (1+8) shrinks Path 65nm SOI 90nm SOI 45nm SOI 90nm SOI PowerXCell is IBM’s name for this new enhanced double-precision (eDP) Cell processor variant 2006 2007 2008 2009 2010 All future dates and specifications are estimations only; Subject to change without notice. Dashed outlines indicate concept designs. Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  9. Slide 9 Industry presentations show changing trends in processors Intel’s Microprocessor Research Lab AMD Fusion Intel’s Visual Computing Group - Larabee nVidia G80 - 2006 Taken from publicly available information Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  10. Slide 10 Roadrunner is on a different path to a petascale petascale 2002 2002 2003 2003 2004 2004 2005 2005 2006 2006 2007 2007 Roadrunner DARK HORSE Skunkworks Cell, 3D memory Clearspeed Cell Clearspeed, Cell Adv. Arch. Project GPU, FPGA HPCS: PERCS PF system design Roadrunner Roadrunner Contract Award 9/8/2006 LANL has been looking at hybrid & petascale petascale computing for Cell is fast some time Cell is energy efficient Cell is commodity Cell brings heterogeneity g g y Cell brings fine-scale paralleism Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  11. A Roadrunner is born Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  12. Slide 12 IBM built hybrid nodes in Rochester, MN and assembled the system in Poughkeepsie, NY Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  13. Slide 13 Roadrunner broke the 1 Petaflop/s mark on May 26 th , 2008 Calculation: ~2 hours Matrix: ~5 trillion entries Calculation: ~2 hours Matrix: ~5 trillion entries Performance: Performance: 1.026 Petaflop/s 1.026 Petaflop/s Only 3 days after the full machine was finally assembled! Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  14. Slide 14 Roadrunner is a TOP performer! # SITE SYSTEM TF/sec 1 DOE/NNSA/LANL Roadrunner, QS22/LS21 #1 on the TOP500 1026 United States IBM DOE/NNSA/LLNL Blue Gene/L 2 478 United States IBM Argonne National Laboratory Blue Gene/P 3 450 United States IBM Texas Adv. Comp. Center SunBlade Opteron IB Cluster 4 326 United States Sun DOE/ORNL Jaguar, XT4-QuadCore 5 205 United States Cray Forschungszentrum Juelich Blue Gene/P 6 180 Germany IBM Green 500 From June 2008 Top 500 List Cell QS22 clusters Roadrunner Mflops / Watt BG/P #3 on the Green500 Xeon Quad BG/L Position Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  15. Roadrunner System Configuration Operated by the Los Alamos National Security, LLC for the DOE/NNSA IBM Confidential

  16. Slide 16 Roadrunner Phase 3 is Cell-accelerated, not a cluster of Cells Cell-accelerated Add Cells to compute node each individual node I/O gateway nodes Multi-socket multi-core Opteron cluster nodes • • • (100’s of such cluster nodes) “Scalable Unit” Cluster Interconnect Switch/Fabric Node-attached Cells is what makes Roadrunner different! Node-attached Cells is what makes Roadrunner different! Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  17. Slide 17 A Roadrunner TriBlade node integrates Cell and Opteron blades • QS22 is an IBM Cell blade containing Cell eDP Cell eDP two new enhanced double-precision (eDP/PowerXCell ™ ) Cell chips 4 GB 4 GB 2xPCI-E x16 (Unused) I/O Hub I/O Hub QS22 2x PCI-E x8 • Expansion blade connects two QS22 via Dual PCI-E x8 flex-cable four PCI-e x8 links to LS21 & provides the node’s ConnectX IB 4X DDR cluster Cell eDP Cell eDP 2 GB/s, 2us, per PCI-e link attachment 4 GB 4 GB 2xPCI-E x16 (Unused) I/O Hub I/O Hub QS22 • LS21 is an IBM dual-socket Opteron 2x PCI-E x8 blade Dual PCI-E x8 flex-cable • 4-wide IBM BladeCenter packaging HSDC HT2100 Connector PCI-E x8 (unused) HT x16 HT x16 IB • Roadrunner Triblades are completely HT2100 to cluster diskless and run from RAM disks with IB 2 x HT x16 Expansion Std PCI-E Exp. 4x Connector Connector blade NFS & Panasas only to the LS21 DDR PCI-E x8 2 GB/s, 2us 2 x HT x16 • Node design points: Exp. AMD AMD Connector HT x16 HT x16 • One Cell chip per Opteron core Dual Dual Core Core • ~400 GF/s double-precision & 8 GB 8 GB HT x16 LS21 ~800 GF/s single-precision • 16 GB Opteron memory PLUS Design point: 16 GB Cell memory Design point: One Cell per Opteron core • 1 PCI-E x8 to each Cell One Cell per Opteron core Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  18. Slide 18 A Roadrunner TriBlade node integrates Cell and Opteron blades Two QS22’s with 2 Cells each Expansion blade LS21 with two dual-core Opterons Operated by the Los Alamos National Security, LLC for the DOE/NNSA

  19. Slide 19 A Connected Unit (CU) forms a building block BC-H chassis 1 TriBlade 1 TriBlade 2 96 To 2 nd Stage Switches TriBlade 3 IB 4x DDR 2+2 GB/s ISR2012 10 GigE 1+1 GB/s 180 IB4x DDR Switch BC-H chassis 60 2U I/O TriBlade 178 Node 1 12 10 GigE to file 2U I/O TriBlade 179 systems Node 12 & LANs 2U Service TriBlade 180 Node Operated by the Los Alamos National Security, LLC for the DOE/NNSA

Recommend


More recommend