Exa & Yotta Scale Data SC 08 Panel November 21 2008, Austin, TX - PowerPoint PPT Presentation

Exa & Yotta Scale Data   SC ʼ 08 Panel November 21 2008, Austin, TX Garth Gibson Carnegie Mellon University and Panasas Inc. SciDAC Petascale Data Storage Institute (PDSI) www.pdsi-scidac.org

Charting the Path thru Exa- to Yotta-scale • Top500.org scaling 100%/yr; Exa in 2018, Zetta in 2028, Yotta in 2038 • Hard to make engineering predictions out 10 years, but 30 years? www.pdsi-scidac.org Garth Gibson, 11/21/2008

Storage Scaling • Trends are quoted in capacity & performance • Balance calls for linear scaling with FLOPS • Disk capacity grows near Moore’s Law • Disk capacity track compute speed • Parallelism grows no better or worse than compute • But disk bandwidth +20%/yr < Moore’s Law • Parallelism for BW grows faster than compute! • Revisit reason for BW balance: fault tolerance • And random access? +7%/yr is nearly no growth • Coupled with BW parallelism, good growth • But new workloads, analytics, more access intensive • Solid state storage looks all but inevitable here www.pdsi-scidac.org Garth Gibson, 11/21/2008

Fault Data & Trends • Los Alamos root cause logs • 22 clusters & 5,000 nodes • covers 9 years & continues • cfdr.usenix.org publication + PNNL, NERSC, Sandia, PSC, … Failures per year per proc # failures normalized by # procs 0.8 0.7 0.6 4096 procs 0.5 1024 nodes 128 procs 6152 procs 32 nodes 0.4 49 nodes 0.3 0.2 0.1 2-way 128-way 256-way 4-way 2003 1996 2004 2001 www.pdsi-scidac.org Garth Gibson, 11/21/2008

Projections: More Failures • Con’t top500.org 2X annually • 1 PF Roadrunner, May 2008 • Cycle time flat, but more of them • Moore’s law: 2X cores/chip in 18 mos • # sockets, 1/MTTI = failure rate up 25%-50% per year • Optimistic 0.1 failures per year per socket (vs. historic 0.25) 10,000,000 600 18 months 18 months Mean time to interrupt (min) 24 months 500 24 months Number of Sockets 30 months 30 months 1,000,000 400 300 100,000 200 100 10,000 0 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 6 7 8 9 0 1 2 3 4 5 6 7 8 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 Year Year www.pdsi-scidac.org Garth Gibson, 11/21/2008

Everything Must Scale with Compute Computing Speed Fault Tolerance Challenge 5,000 TFLOP/s Memory 2,500 TeraBytes Year 500 2012 250 • Periodic (p) pause to checkpoint (t) 50 ‘08 25 Disk ‘04 50 5 2.5 5 • Major need for storage bandwidth PetaBytes .5 ‘00 .05 5 3 10 10 2 10 1 5 • Balanced systems 50 200 Parallel 500 .5 .5 I/O 5,000 200 • Storage speed tracks FLOPS, memory GigaBytes/sec 5 2,000 5 so checkpoint capture (t) is constant 20,000 50 50 Metadata 500 Inserts/sec • 1 – AppUtilization = t/p + p/(2*MTTI) Network Speed 500 Archival Gigabits/sec p 2 = 2*t*MTTI Storage 600 6 GigaBytes/sec 18 months Mean time to interrupt (min) 500 24 months 100% 30 months 18 months 400 Application Utilization (%) 24 months 300 75% 30 months 200 100 50% 0 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Year 25% • but dropping MTTI kills app utilization! 0% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Year www.pdsi-scidac.org Garth Gibson, 11/21/2008

Fault Tolerance Drives Bandwidth 1,000,000 • More storage bandwidth? 18 months Disk Bandwidth Increase 100,000 24 months • disk speed 1.2X/yr 10,000 30 months – # disks +67%/y 1,000 just for balance ! 100 • to also counter MTTI 10 1 – # disks +130%/yr ! 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 • Little appetite for the cost Year • N-1 checkpoints hurt BW • Concurrent strided write • Will fix with internal file structure: write optimized • See Zest, ADIOS, …. www.pdsi-scidac.org Garth Gibson, 11/21/2008

Alternative: Specialize Checkpoints • Dedicated checkpoint device (ie., PSC Zest) Compute Cluster • Stage checkpoint through fast memory • Cost of dedicated memory large fraction of total • Cheaper SSD (flash?) now bandwidth limited • There is hope: 1 flash chip == 1 disk BW ….. FAST WRITE Checkpoint Memory SLOW WRITE Disk Storage Devices www.pdsi-scidac.org Garth Gibson, 11/21/2008

Application Level Alternatives 100 18 months • Compress checkpoints! 90 24 months Memory % in Checkpoint 80 30 months 70 • plenty of cycles available 60 50 • smaller fraction of memory 40 each year (application specific) 30 20 – 25-50% smaller per year 10 0 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Year • Classic enterprise answer: process pairs duplication • Flat 50% efficiency cost, plus message duplication www.pdsi-scidac.org Garth Gibson, 11/21/2008

Storage Suffers Failures Too Type of drive Count Duration 18GB 10K RPM SCSI 3,400 5 yrs HPC1 36GB 10K RPM SCSI 36GB 10K RPM SCSI HPC2 520 2.5 yrs 15K RPM SCSI 14,208 1 yr HPC3 15K RPM SCSI 7.2K RPM SATA Supercomputing X 250GB SATA 500GB SATA HPC4 13,634 3 yrs Various HPCs 400GB SATA COM1 10K RPM SCSI 1 month 26,734 15K RPM SCSI COM2 39,039 1.5 yrs 10K RPM FC-AL Internet services Y 10K RPM FC-AL COM3 3,700 1 yr 10K RPM FC-AL 10K RPM FC-AL www.pdsi-scidac.org Garth Gibson, 11/21/2008

Storage Failure Recovery is On-the-fly • Scalable performance = more disks • But disks are getting bigger • Recovery per failure increasing SATA • Hours to days on disk arrays Data avrg = 3% • Consider # concurrent disk recoveries ARR = 0.88% e.g. 10,000 disks ARR = 0.58% 3% per year replacement rate 1000.0 # Concurrent Reconstructions 1+ day recovery each 100.0 Constant state of recovering ? • Maybe soon 100s of 10.0 concurrent recoveries (at all times!) 1.0 • Design normal case for many failures (huge challenge!) 0.1 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Year www.pdsi-scidac.org Garth Gibson, 11/21/2008

Parallel Scalable Repair • Defer the problem by making failed disk repair a parallel app • File replication and, more recently, object RAID can scale repair - “decluster” redundancy groups over all disks (mirror or RAID) - use all disks for every repair, faster is less vulnerable • Object (chunk of a file) storage architecture dominating at scale PanFS, Lustre, PVFS, … GFS, HDFS, … Centera, … Rebuild MB/sec H G k E C F E www.pdsi-scidac.org Garth Gibson, 11/21/2008

Scaling Exa- to Yotta-Scale • Exascale capacity parallelism not worse than compute parallelism – But internal fault tolerance harder for storage than compute • Exascale bandwidth a big problem, but dominated by checkpoint – Specialize checkpoint solutions to reduce stress – Log-structured files, dedicated devices, Flash memory ….. – Application alternatives: state compression, process pairs • Long term: 20%/yr bandwidth growth serious concern – Primary problem is economic: what is value of data vs compute? • Long term: 7%/yr access rate growth threatens market size – Solid state will replace disk for small random access www.pdsi-scidac.org Garth Gibson, 11/21/2008

SciDAC Petascale Data Storage Institute • High Performance Storage Expertise & Experience • Carnegie Mellon University, Garth Gibson, lead PI • U. of California, Santa Cruz, Darrell Long • U. of Michigan, Ann Arbor, Peter Honeyman • Lawrence Berkeley National Lab, William Kramer • Oak Ridge National Lab, Phil Roth • Pacific Northwest National Lab, Evan Felix • Los Alamos National Lab, Gary Grider • Sandia National Lab, Lee Ward www.pdsi-scidac.org Garth Gibson, 11/21/2008

Exa & Yotta Scale Data SC 08 Panel November 21 2008, Austin, TX - PowerPoint PPT Presentation

Exa & Yotta Scale Data SC 08 Panel November 21 2008, Austin, TX Garth Gibson Carnegie Mellon University and Panasas Inc. SciDAC Petascale Data Storage Institute (PDSI) www.pdsi-scidac.org Charting the Path thru Exa- to Yotta-scale

Exa- to Yotta-scale Data An Optimistic View Rob Farber PNNL Optimistic about Storage Bandwidth

Challenges and Solutions for Peta- and Exa-Sacle Programming Tasuku Hiraishi Academic Center for

Challenge and Solutions for { Peta | Exa }-scale Programming WPSE09 panel discussion Raymond

PROJECT HOLLOWAY LANDSCAPE PRESENTATION AUGUST 2020 ABOUT EXTERIOR ARCHITECTURE WHO ARE EXA?

PS4000 Assembly Guide Part List: A. 1 x Left Panel B. 1 x Right Panel C. 1 x Bottom Panel

Integration of Burst Buffer in High- level Parallel IO Library for Exa- scale Computing Era SC

Enzo-E/Cello Project: Enabling Exa-Scale Astrophysics Andrew Emerick Columbia University

ExaDG: High-order discontinuous Galerkin for the exa-scale Guido Kanschat 1 Katharina Kormann 2

Challenges in fluid flow simulations using Exa-scale computing Mahendra Verma IIT Kanpur

SEPG 2007 SEPG 2007 SPIN Panel SPIN Panel SEPG2007 - SPIN Panel Session SEPG2007 - SPIN Panel

FEC403EN Extinguishing Control Panel FEC403EN Extinguishing panel Table of contents Panel

Statistics of the Universe: Exa-calculations and Cosmology's Data Deluge Matt Bellis Debbie

Austin Coming Together Why Austin Career Connect? In January 2014, ACTs Austin Workforce

Restaurant Innovation Summit Hilton Austin Austin, Texas September 13 - 15, 2016 #RISummit16

Platinum 2008 2008 Interim Review Interim Review Platinum th November 2008 18 th 18 November

MBAweb Panel 2019-12-23 1 MBA Recherche MBAweb Panel MBAweb Panel Presentation 2019-12-23

Marine Environmental Monitoring Augustine Ekweariri University of Hamburg Faculty of

Deep Parameter Confjguration Lars Kotthofg University of British Columbia larsko@cs.ubc.ca

SOFTWARE ARCHITECTURE For MOLECTRONICS John H. Reif Computer Science Dept Duke Univ. In

Formal Verification of Candidate Solutions for Evolutionary Circuit Design (Entry 04) Zden k

GNNs for HL-LHC Tracking ExaTrkX @ Berkeley Lab Daniel Murnane Office of BERKELEY LAB 1

Introductjon to EuroEXA EuroEXA: Co-designed Innovatjon and System for Resilient Exascale

THE ROAD TO EXASCALE: HARDWARE AND SOFTWARE CHALLENGES JACK DONGARRA UNIVERSITY OF TENNESSEE

Engineering Analysis ENG 3420 Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th