Storage Lessons from HPC: Extreme Scale Computing Driving - PowerPoint PPT Presentation

Storage Lessons from HPC: Extreme Scale Computing Driving Economical Storage Solutions into Your IT Environments Gary Grider HPC Division Leader, LANL/US DOE Mar 2017 LA-UR-16-26379

Lo Los Al s Alamos 2

Eigh ght Dec Decades es of of Production Weapons Computing to K Keep eep the N he Nation S Safe Maniac IBM Stretch CDC Cray 1 Cray X/Y CM-2 CM-5 SGI Blue Mountain DEC/HP Q IBM Cell Roadrunner Cray XE Cielo Ising DWave Cray Intel KNL Trinity Cross Roads

LANL NL H HPC History Project t (50k 50k a artifac acts) ts) J Joint work wi k with U U Mi Minn Babba bbage I Ins nstitut ute 4

Som ome Prod oducts ts You ou May y Not ot Realize Were Ei Eith ther Funded o unded or Hea eavi vily ly I Influen ence ced b d by DOE DOE H HPC Data Warp 5

Ceph begins

The Promise of Parallel POSIX Metadata Service Circa 2001

And m And more Quantum Key Distribution Products IBM Photostore Hydra – the first Storage Area Network

Ex Extr treme H HPC B Background 9

Simple Vi View of of ou our Com omputi ting En Environment

Curren ent Largest M Machine T e Trinity • Haswell and KNL • 20,000 Nodes • Few Million Cores • 2 PByte DRAM • 4 PByte NAND Burst Buffer ~ 4 Tbyte/sec • 100 Pbyte Scratch PMR Disk File system ~1.2 Tbyte/sec • 30PByte/year Sitewide SMR Disk Campaign Store ~ 1 Gbyte/sec/Pbyte (30 Gbyte/sec currently) • 60 PByte Sitewide Parallel Tape Archive ~ 3 Gbyte/sec 11

A A no not s so simple pi pict cture e of our ur en environmen ent • 30-60MW • Single machines in the 10k nodes and > 18 MW • Single jobs that run across 1M cores for months • Soccer fields of gear in 3 buildings • 20 Semi’s of gear this summer alone Pipes for Trinity Cooling

HP HPC Storag age A Area a Ne Network Circa 2 a 2011 011 Today hi high gh end end is a few TB/sec Current Storage Area Network is a few Tbytes/sec, mostly IB, some 40/100GE

HPC I IO Patte tterns • Million files inserted into a single directory at the same time • Millions of writers into the same file at the same time • Jobs from 1 core to N-Million cores • Files from 0 bytes to N-Pbytes • Workflows from hours to a year (yes a year on a million cores using a PB DRAM)

Becau ause N Non C Comp mpute C Costs a are R Rising a as TCO, Wor orkf kflows a are N Nece cessary y to to S Speci cify 15

Wor orkflow T Taxon onomy fro rom APEX P Procurem emen ent A Si Simula lation P Pipelin line

Enoug ugh w h with t h the HPC b background und How about som ome m mod odern rn S Stor orage E Econ onom omics

Econo nomics h have s e shaped o ped our world The beg he beginning o of storage l e layer er pr prolifer eration 2009 009 • Economic modeling for large burst of data from memory shows bandwidth / capacity better matched for solid state storage near the compute nodes Hdwr/media cost 3 mem/mo 10% FS Economic modeling for $25,000,000  $20,000,000 new servers archive shows bandwidth / new disk $15,000,000 capacity better matched for new cartridges $10,000,000 disk new drives $5,000,000 new robots $0 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025

Wha hat are a e all t thes hese storage l layers? Why do do we e need need all t thes hese s storage l e layers? HPC After Trinity 1-2 PB/sec Memory Residence – hours HPC Before Trinity Overwritten – continuous 4-6 TB/sec Burst Buffer Residence – hours Memory DRAM Overwritten – hours Lustre 1-2 TB/sec Parallel File System Parallel File System Parallel File Residence – days/weeks System Flushed – weeks 100-300 GB/sec HPSS Campaign Storage Residence – months-year Archive Parallel Flushed – months-year Tape 10s GB/sec (parallel tape Archive Residence – forever • Why Campaign: Economics (PFS Raid too expensive,  • BB: Economics (disk PFS solution too rich in function, PFS metadata bw/iops too not scalable enough, PFS designed for scratch use not years residency, Archive BW too expensive) expensive/difficult, Archive metadata too slow)

The Hoopla P a Parad ade c circa a 2014 014 Data Warp

Isn sn’t t ’t that t too oo m many l layers j s just f for or s storage? Factoids Memory Memory (times are changing!) Burst Buffer IOPS/BW Tier LANL HPSS = 53 PB and 543 M files Parallel File System (PFS) Parallel File System (PFS) Trinity 2 PB memory, 4 PB flash (11% of Capacity Tier Campaign Storage HPSS) and 80 PB PFS or 150% Diagram HPSS) courtesy of Archive Archive John Bent Crossroads may EMC have 5-10 PB • If the Burst Buffer does its job very well (and indications are memory, 40 PB solid state or capacity of in system NV will grow radically) and campaign 100% of HPSS storage works out well (leveraging cloud), do we need a parallel file system anymore, or an archive? We would have never contemplated more in • Maybe just a bw/iops tier and a capacity tier. system storage than our • Too soon to say, seems feasible longer term archive a few years ago

I doubt this movement to solid state for BW/IOPS (hot/warm) and SMR/HAMR/etc. capacity oriented disk for Capacity (cool/cold) is unique to HPC Ok – we need a capacity tier Campaign Storage -Billions of files / directory -Trillions of files total -Files from 1 byte to 100 PB -Multiple writers into one file What now?

Won’t cloud technology provide the capacity solution? • Erasure to utilize low cost hardware • Object to enable massive scale • Simple minded interface, get put delete • Problem solved --  NOT • Works great for apps that are newly written to use this interface • Doesn’t work well for people, people need folders and rename and … • Doesn’t work for the $trillions of apps out there that expect some modest name space capability (parts of POSIX)

Enter MarFS The Sea of Data 24

How about a Scalable Near-POSIX Name Space over Cloud style Object Erasure? • Best of both worlds • Objects Systems • Provide massive scaling and efficient erasure techniques • Friendly to applications, not to people. People need a name space. • Huge Economic appeal (erasure enables use of inexpensive storage) • POSIX name space is powerful but has issues scaling • The challenges • Mismatch of POSIX an Object metadata, security, read/write semantics, efficient object/file sizes. • No update in place with Objects • How do we scale POSIX name space to trillions of files/directories

Won’t someone else do it, PLEASE? • There is evidence others see the need but no magic bullets yet: (partial list) • Cleversafe/Scality/EMC ViPR/Ceph/Swift etc. attempting multi-personality data lakes over erasure objects, all are young and assume update in place for posix • GlusterFS is probably the closes thing to MarFS. Gluster is aimed more for the enterprise and midrange HPC and less for extreme HPC. Glusterfs is a way to unify file and object systems, MarFS is another, aiming at different uses • Ceph moved away from file system like access at the time of this analysis. • General Atomics Nirvana, Storage Resource Broker/IRODS optimized for WAN and HSM metadata rates. There are some capabilities for putting POSIX files over objects, but these methods are largely via NFS or other methods that try to mimic full file system semantics including update in place. These methods are not designed for massive parallelism in a single file, etc. • EMC Maginatics but it is in its infancy and isnt a full solution to our problem yet. • Camlistore appears to be targeted and personal storage. • Bridgestore is a POSIX name space over objects but they put their metadata in a flat space so rename of a directory is painful. • Avere over objects is focused at NFS so N to 1 is a non starter. • HPSS or SamQFS or a classic HSM? Metadata rates designs are way low. • HDFS metadata doesn’t scale well.

MarFS What it is • 100-1000 GB/sec, Exabytes, Billion files in a directory, Trillions of files total • Near-POSIX global scalable name space over many POSIX and non POSIX data repositories (Scalable object systems - CDMI, S3, etc.) • (Scality, EMC ECS, all the way to simple erasure over ZFS’s) • It is small amount of code (C/C++/Scripts) • A small Linux Fuse • A pretty small parallel batch copy/sync/compare/ utility • A moderate sized library both FUSE and the batch utilities call • Data movement scales just like many scalable object systems • Metadata scales like NxM POSIX name spaces both across the tree and within a single directory • It is friendly to object systems by • Spreading very large files across many objects • Packing many small files into one large data object What it isnt • No Update in place! Its not a pure file system, Overwrites are fine but no seeking and writing.

MarFS Scaling Scaling test on our retired Cielo machine: Striping across 1 to X 835M File Inserts/sec Stat single file < 1 millisecond Object Repos > 1 trillion files in the same director

Storage Lessons from HPC: Extreme Scale Computing Driving - PowerPoint PPT Presentation

Storage Lessons from HPC: Extreme Scale Computing Driving Economical Storage Solutions into Your IT Environments Gary Grider HPC Division Leader, LANL/US DOE Mar 2017 LA-UR-16-26379 Lo Los Al s Alamos 2 Eigh ght Dec Decades es of

Synergistic Challenges in Data-Intensive Science and Extreme Scale Computing Vivek Sarkar

Extreme-scale Computing Global Knowledge without Global Communication Dr. Giuse seppe pe Di Fa

Opportunities in Biology at the Opportunities in Biology at the Extreme Scale of Computing

With Extreme Scale Computing the Rules Have Changed Jack Dongarra University of Tennessee Oak

Billion-Way Resiliency for Extreme Scale Computing Seminar at German Research School for

Self-Driving Cars As Edge Computing Devices Matt Ranney - @mranney Uber ATG Why Self-Driving?

Hierarchical Locality and Parallel Programming in the Extreme Scale Era Tarek El-Ghazawi The

HermitCore A Unikernel for Extreme Scale Computing Stefan Lankes 1 , Simon Pickartz 1 , Jens

NextGen. Computing and Storage at Scale Overview and Implementation within the European HPC

Storage Performance Isolation: an Investigation of Contemporary I/O Schedulers Sarala Arunagiri

Exasc scale Scientifi fic c Co Computing The Road Ahead Kemal A. Delic Martin Antony

Exploring the Design Space of Combining Linux with Lightweight Kernels for Extreme Scale

APIs, Architecture and Modeling for Extreme Scale Resilience Dagstuhl Seminar: Resilience in

Rethinking Power, Resilience, and Sustainability Issues for Large-scale Computing and Storage

Mero: Co-Designing an Object Store for Extreme Scale Presented at PDSW2016(SC16) Presented

Limitlessly Scalable Storage for Capacity-Intensive Computing Meet Cloudian S3-compatible

Extreme Scale Computer Architecture: Energy Efficiency from the Ground Up Josep Torrellas

11-0730 LA-UR- Approved for public release; distribution is unlimited. Title: Extreme Scale

and Extreme Scale Research Computing D. Karres, Beckman Institute J. Alameda, National Center

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

CREST Research in Dynamic Adaptive Methods for Extreme Scale Computation Thomas Sterling

ScoutFS: POSIX Archiving at Extreme Scale Zach Brown, Versity MSST 2019 POSIX Archiving with

Use of a New I/O Stack for Extreme-scale Systems in

JST-CREST Extreme Big Data Project (2013-2018) Future Non-Silo Extreme Big Data Scientific