exa yotta scale data sc 08 panel november 21 2008 austin
play

Exa & Yotta Scale Data SC 08 Panel November 21 2008, Austin, TX - PowerPoint PPT Presentation

Exa & Yotta Scale Data SC 08 Panel November 21 2008, Austin, TX Garth Gibson Carnegie Mellon University and Panasas Inc. SciDAC Petascale Data Storage Institute (PDSI) www.pdsi-scidac.org Charting the Path thru Exa- to Yotta-scale


  1. Exa & Yotta Scale Data 
 SC ʼ 08 Panel November 21 2008, Austin, TX Garth Gibson Carnegie Mellon University and Panasas Inc. SciDAC Petascale Data Storage Institute (PDSI) www.pdsi-scidac.org

  2. Charting the Path thru Exa- to Yotta-scale • Top500.org scaling 100%/yr; Exa in 2018, Zetta in 2028, Yotta in 2038 • Hard to make engineering predictions out 10 years, but 30 years? www.pdsi-scidac.org Garth Gibson, 11/21/2008

  3. Storage Scaling • Trends are quoted in capacity & performance • Balance calls for linear scaling with FLOPS • Disk capacity grows near Moore’s Law • Disk capacity track compute speed • Parallelism grows no better or worse than compute • But disk bandwidth +20%/yr < Moore’s Law • Parallelism for BW grows faster than compute! • Revisit reason for BW balance: fault tolerance • And random access? +7%/yr is nearly no growth • Coupled with BW parallelism, good growth • But new workloads, analytics, more access intensive • Solid state storage looks all but inevitable here www.pdsi-scidac.org Garth Gibson, 11/21/2008

  4. Fault Data & Trends • Los Alamos root cause logs • 22 clusters & 5,000 nodes • covers 9 years & continues • cfdr.usenix.org publication + PNNL, NERSC, Sandia, PSC, … Failures per year per proc # failures normalized by # procs 0.8 0.7 0.6 4096 procs 0.5 1024 nodes 128 procs 6152 procs 32 nodes 0.4 49 nodes 0.3 0.2 0.1 2-way 128-way 256-way 4-way 2003 1996 2004 2001 www.pdsi-scidac.org Garth Gibson, 11/21/2008

  5. Projections: More Failures • Con’t top500.org 2X annually • 1 PF Roadrunner, May 2008 • Cycle time flat, but more of them • Moore’s law: 2X cores/chip in 18 mos • # sockets, 1/MTTI = failure rate up 25%-50% per year • Optimistic 0.1 failures per year per socket (vs. historic 0.25) 10,000,000 600 18 months 18 months Mean time to interrupt (min) 24 months 500 24 months Number of Sockets 30 months 30 months 1,000,000 400 300 100,000 200 100 10,000 0 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 6 7 8 9 0 1 2 3 4 5 6 7 8 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 Year Year www.pdsi-scidac.org Garth Gibson, 11/21/2008

  6. Everything Must Scale with Compute Computing Speed Fault Tolerance Challenge 5,000 TFLOP/s Memory 2,500 TeraBytes Year 500 2012 250 • Periodic (p) pause to checkpoint (t) 50 ‘08 25 Disk ‘04 50 5 2.5 5 • Major need for storage bandwidth PetaBytes .5 ‘00 .05 5 3 10 10 2 10 1 5 • Balanced systems 50 200 Parallel 500 .5 .5 I/O 5,000 200 • Storage speed tracks FLOPS, memory GigaBytes/sec 5 2,000 5 so checkpoint capture (t) is constant 20,000 50 50 Metadata 500 Inserts/sec • 1 – AppUtilization = t/p + p/(2*MTTI) Network Speed 500 Archival Gigabits/sec p 2 = 2*t*MTTI Storage 600 6 GigaBytes/sec 18 months Mean time to interrupt (min) 500 24 months 100% 30 months 18 months 400 Application Utilization (%) 24 months 300 75% 30 months 200 100 50% 0 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Year 25% • but dropping MTTI kills app utilization! 0% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Year www.pdsi-scidac.org Garth Gibson, 11/21/2008

  7. Fault Tolerance Drives Bandwidth 1,000,000 • More storage bandwidth? 18 months Disk Bandwidth Increase 100,000 24 months • disk speed 1.2X/yr 10,000 30 months – # disks +67%/y 1,000 just for balance ! 100 • to also counter MTTI 10 1 – # disks +130%/yr ! 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 • Little appetite for the cost Year • N-1 checkpoints hurt BW • Concurrent strided write • Will fix with internal file structure: write optimized • See Zest, ADIOS, …. www.pdsi-scidac.org Garth Gibson, 11/21/2008

  8. Alternative: Specialize Checkpoints • Dedicated checkpoint device (ie., PSC Zest) Compute Cluster • Stage checkpoint through fast memory • Cost of dedicated memory large fraction of total • Cheaper SSD (flash?) now bandwidth limited • There is hope: 1 flash chip == 1 disk BW ….. FAST WRITE Checkpoint Memory SLOW WRITE Disk Storage Devices www.pdsi-scidac.org Garth Gibson, 11/21/2008

  9. Application Level Alternatives 100 18 months • Compress checkpoints! 90 24 months Memory % in Checkpoint 80 30 months 70 • plenty of cycles available 60 50 • smaller fraction of memory 40 each year (application specific) 30 20 – 25-50% smaller per year 10 0 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Year • Classic enterprise answer: process pairs duplication • Flat 50% efficiency cost, plus message duplication www.pdsi-scidac.org Garth Gibson, 11/21/2008

  10. Storage Suffers Failures Too Type of drive Count Duration 18GB 10K RPM SCSI 3,400 5 yrs HPC1 36GB 10K RPM SCSI 36GB 10K RPM SCSI HPC2 520 2.5 yrs 15K RPM SCSI 14,208 1 yr HPC3 15K RPM SCSI 7.2K RPM SATA Supercomputing X 250GB SATA 500GB SATA HPC4 13,634 3 yrs Various HPCs 400GB SATA COM1 10K RPM SCSI 1 month 26,734 15K RPM SCSI COM2 39,039 1.5 yrs 10K RPM FC-AL Internet services Y 10K RPM FC-AL COM3 3,700 1 yr 10K RPM FC-AL 10K RPM FC-AL www.pdsi-scidac.org Garth Gibson, 11/21/2008

  11. Storage Failure Recovery is On-the-fly • Scalable performance = more disks • But disks are getting bigger • Recovery per failure increasing SATA • Hours to days on disk arrays Data avrg = 3% • Consider # concurrent disk recoveries ARR = 0.88% e.g. 10,000 disks ARR = 0.58% 3% per year replacement rate 1000.0 # Concurrent Reconstructions 1+ day recovery each 100.0 Constant state of recovering ? • Maybe soon 100s of 10.0 concurrent recoveries (at all times!) 1.0 • Design normal case for many failures (huge challenge!) 0.1 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Year www.pdsi-scidac.org Garth Gibson, 11/21/2008

  12. Parallel Scalable Repair • Defer the problem by making failed disk repair a parallel app • File replication and, more recently, object RAID can scale repair - “decluster” redundancy groups over all disks (mirror or RAID) - use all disks for every repair, faster is less vulnerable • Object (chunk of a file) storage architecture dominating at scale PanFS, Lustre, PVFS, … GFS, HDFS, … Centera, … Rebuild MB/sec H G k E C F E www.pdsi-scidac.org Garth Gibson, 11/21/2008

  13. Scaling Exa- to Yotta-Scale • Exascale capacity parallelism not worse than compute parallelism – But internal fault tolerance harder for storage than compute • Exascale bandwidth a big problem, but dominated by checkpoint – Specialize checkpoint solutions to reduce stress – Log-structured files, dedicated devices, Flash memory ….. – Application alternatives: state compression, process pairs • Long term: 20%/yr bandwidth growth serious concern – Primary problem is economic: what is value of data vs compute? • Long term: 7%/yr access rate growth threatens market size – Solid state will replace disk for small random access www.pdsi-scidac.org Garth Gibson, 11/21/2008

  14. SciDAC Petascale Data Storage Institute • High Performance Storage Expertise & Experience • Carnegie Mellon University, Garth Gibson, lead PI • U. of California, Santa Cruz, Darrell Long • U. of Michigan, Ann Arbor, Peter Honeyman • Lawrence Berkeley National Lab, William Kramer • Oak Ridge National Lab, Phil Roth • Pacific Northwest National Lab, Evan Felix • Los Alamos National Lab, Gary Grider • Sandia National Lab, Lee Ward www.pdsi-scidac.org Garth Gibson, 11/21/2008

Recommend


More recommend