storage lessons from hpc extreme scale computing driving
play

Storage Lessons from HPC: Extreme Scale Computing Driving - PowerPoint PPT Presentation

Storage Lessons from HPC: Extreme Scale Computing Driving Economical Storage Solutions into Your IT Environments Gary Grider HPC Division Leader, LANL/US DOE Mar 2017 LA-UR-16-26379 Lo Los Al s Alamos 2 Eigh ght Dec Decades es of


  1. Storage Lessons from HPC: Extreme Scale Computing Driving Economical Storage Solutions into Your IT Environments Gary Grider HPC Division Leader, LANL/US DOE Mar 2017 LA-UR-16-26379

  2. Lo Los Al s Alamos 2

  3. Eigh ght Dec Decades es of of Production Weapons Computing to K Keep eep the N he Nation S Safe Maniac IBM Stretch CDC Cray 1 Cray X/Y CM-2 CM-5 SGI Blue Mountain DEC/HP Q IBM Cell Roadrunner Cray XE Cielo Ising DWave Cray Intel KNL Trinity Cross Roads

  4. LANL NL H HPC History Project t (50k 50k a artifac acts) ts) J Joint work wi k with U U Mi Minn Babba bbage I Ins nstitut ute 4

  5. Som ome Prod oducts ts You ou May y Not ot Realize Were Ei Eith ther Funded o unded or Hea eavi vily ly I Influen ence ced b d by DOE DOE H HPC Data Warp 5

  6. Ceph begins

  7. The Promise of Parallel POSIX Metadata Service Circa 2001

  8. And m And more Quantum Key Distribution Products IBM Photostore Hydra – the first Storage Area Network

  9. Ex Extr treme H HPC B Background 9

  10. Simple Vi View of of ou our Com omputi ting En Environment

  11. Curren ent Largest M Machine T e Trinity • Haswell and KNL • 20,000 Nodes • Few Million Cores • 2 PByte DRAM • 4 PByte NAND Burst Buffer ~ 4 Tbyte/sec • 100 Pbyte Scratch PMR Disk File system ~1.2 Tbyte/sec • 30PByte/year Sitewide SMR Disk Campaign Store ~ 1 Gbyte/sec/Pbyte (30 Gbyte/sec currently) • 60 PByte Sitewide Parallel Tape Archive ~ 3 Gbyte/sec 11

  12. A A no not s so simple pi pict cture e of our ur en environmen ent • 30-60MW • Single machines in the 10k nodes and > 18 MW • Single jobs that run across 1M cores for months • Soccer fields of gear in 3 buildings • 20 Semi’s of gear this summer alone Pipes for Trinity Cooling

  13. HP HPC Storag age A Area a Ne Network Circa 2 a 2011 011 Today hi high gh end end is a few TB/sec Current Storage Area Network is a few Tbytes/sec, mostly IB, some 40/100GE

  14. HPC I IO Patte tterns • Million files inserted into a single directory at the same time • Millions of writers into the same file at the same time • Jobs from 1 core to N-Million cores • Files from 0 bytes to N-Pbytes • Workflows from hours to a year (yes a year on a million cores using a PB DRAM)

  15. Becau ause N Non C Comp mpute C Costs a are R Rising a as TCO, Wor orkf kflows a are N Nece cessary y to to S Speci cify 15

  16. Wor orkflow T Taxon onomy fro rom APEX P Procurem emen ent A Si Simula lation P Pipelin line

  17. Enoug ugh w h with t h the HPC b background und How about som ome m mod odern rn S Stor orage E Econ onom omics

  18. Econo nomics h have s e shaped o ped our world The beg he beginning o of storage l e layer er pr prolifer eration 2009 009 • Economic modeling for large burst of data from memory shows bandwidth / capacity better matched for solid state storage near the compute nodes Hdwr/media cost 3 mem/mo 10% FS Economic modeling for $25,000,000  $20,000,000 new servers archive shows bandwidth / new disk $15,000,000 capacity better matched for new cartridges $10,000,000 disk new drives $5,000,000 new robots $0 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025

  19. Wha hat are a e all t thes hese storage l layers? Why do do we e need need all t thes hese s storage l e layers? HPC After Trinity 1-2 PB/sec Memory Residence – hours HPC Before Trinity Overwritten – continuous 4-6 TB/sec Burst Buffer Residence – hours Memory DRAM Overwritten – hours Lustre 1-2 TB/sec Parallel File System Parallel File System Parallel File Residence – days/weeks System Flushed – weeks 100-300 GB/sec HPSS Campaign Storage Residence – months-year Archive Parallel Flushed – months-year Tape 10s GB/sec (parallel tape Archive Residence – forever • Why Campaign: Economics (PFS Raid too expensive,  • BB: Economics (disk PFS solution too rich in function, PFS metadata bw/iops too not scalable enough, PFS designed for scratch use not years residency, Archive BW too expensive) expensive/difficult, Archive metadata too slow)

  20. The Hoopla P a Parad ade c circa a 2014 014 Data Warp

  21. Isn sn’t t ’t that t too oo m many l layers j s just f for or s storage? Factoids Memory Memory (times are changing!) Burst Buffer IOPS/BW Tier LANL HPSS = 53 PB and 543 M files Parallel File System (PFS) Parallel File System (PFS) Trinity 2 PB memory, 4 PB flash (11% of Capacity Tier Campaign Storage HPSS) and 80 PB PFS or 150% Diagram HPSS) courtesy of Archive Archive John Bent Crossroads may EMC have 5-10 PB • If the Burst Buffer does its job very well (and indications are memory, 40 PB solid state or capacity of in system NV will grow radically) and campaign 100% of HPSS storage works out well (leveraging cloud), do we need a parallel file system anymore, or an archive? We would have never contemplated more in • Maybe just a bw/iops tier and a capacity tier. system storage than our • Too soon to say, seems feasible longer term archive a few years ago

  22. I doubt this movement to solid state for BW/IOPS (hot/warm) and SMR/HAMR/etc. capacity oriented disk for Capacity (cool/cold) is unique to HPC Ok – we need a capacity tier Campaign Storage -Billions of files / directory -Trillions of files total -Files from 1 byte to 100 PB -Multiple writers into one file What now?

  23. Won’t cloud technology provide the capacity solution? • Erasure to utilize low cost hardware • Object to enable massive scale • Simple minded interface, get put delete • Problem solved --  NOT • Works great for apps that are newly written to use this interface • Doesn’t work well for people, people need folders and rename and … • Doesn’t work for the $trillions of apps out there that expect some modest name space capability (parts of POSIX)

  24. Enter MarFS The Sea of Data 24

  25. How about a Scalable Near-POSIX Name Space over Cloud style Object Erasure? • Best of both worlds • Objects Systems • Provide massive scaling and efficient erasure techniques • Friendly to applications, not to people. People need a name space. • Huge Economic appeal (erasure enables use of inexpensive storage) • POSIX name space is powerful but has issues scaling • The challenges • Mismatch of POSIX an Object metadata, security, read/write semantics, efficient object/file sizes. • No update in place with Objects • How do we scale POSIX name space to trillions of files/directories

  26. Won’t someone else do it, PLEASE? • There is evidence others see the need but no magic bullets yet: (partial list) • Cleversafe/Scality/EMC ViPR/Ceph/Swift etc. attempting multi-personality data lakes over erasure objects, all are young and assume update in place for posix • GlusterFS is probably the closes thing to MarFS. Gluster is aimed more for the enterprise and midrange HPC and less for extreme HPC. Glusterfs is a way to unify file and object systems, MarFS is another, aiming at different uses • Ceph moved away from file system like access at the time of this analysis. • General Atomics Nirvana, Storage Resource Broker/IRODS optimized for WAN and HSM metadata rates. There are some capabilities for putting POSIX files over objects, but these methods are largely via NFS or other methods that try to mimic full file system semantics including update in place. These methods are not designed for massive parallelism in a single file, etc. • EMC Maginatics but it is in its infancy and isnt a full solution to our problem yet. • Camlistore appears to be targeted and personal storage. • Bridgestore is a POSIX name space over objects but they put their metadata in a flat space so rename of a directory is painful. • Avere over objects is focused at NFS so N to 1 is a non starter. • HPSS or SamQFS or a classic HSM? Metadata rates designs are way low. • HDFS metadata doesn’t scale well.

  27. MarFS What it is • 100-1000 GB/sec, Exabytes, Billion files in a directory, Trillions of files total • Near-POSIX global scalable name space over many POSIX and non POSIX data repositories (Scalable object systems - CDMI, S3, etc.) • (Scality, EMC ECS, all the way to simple erasure over ZFS’s) • It is small amount of code (C/C++/Scripts) • A small Linux Fuse • A pretty small parallel batch copy/sync/compare/ utility • A moderate sized library both FUSE and the batch utilities call • Data movement scales just like many scalable object systems • Metadata scales like NxM POSIX name spaces both across the tree and within a single directory • It is friendly to object systems by • Spreading very large files across many objects • Packing many small files into one large data object What it isnt • No Update in place! Its not a pure file system, Overwrites are fine but no seeking and writing.

  28. MarFS Scaling Scaling test on our retired Cielo machine: Striping across 1 to X 835M File Inserts/sec Stat single file < 1 millisecond Object Repos > 1 trillion files in the same director

Recommend


More recommend