DVS, GPFS and External Lustre at NERSC How Its Working on Hopper - PowerPoint PPT Presentation

DVS, GPFS and External Lustre at NERSC – How It’s Working on Hopper Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011 1

NERSC is the Primary Computing Center for DOE Office of Science • NERSC serves a large population Approximately 3000 users, 400 projects, 500 codes • Focus on “unique” resources – Expert consulting and other services – High end computing & storage systems • NERSC is known for: – Excellent services & diverse workload 2010 allocations Physics Math + CS Astrophysics Chemistry Climate Combustion Fusion Lattice Gauge Life Sciences Materials Other 2

NERSC Systems Large-Scale Computing Systems Franklin (NERSC-5): Cray XT4 • 9,532 compute nodes; 38,128 cores • ~25 Tflop/s on applications; 356 Tflop/s peak Hopper (NERSC-6): Cray XE6 • Phase 1: Cray XT5, 668 nodes, 5344 cores • Phase 2: Cray XE6, 6384 nodes, 153216 cores 1.28 Pflop/s peak C lusters NERSC Global Analytics 140 Tflops total Filesystem (NGF) Carver Uses IBM’s GPFS • IBM iDataplex cluster • 1.5 PB capacity PDSF (HEP/NP) Euclid • 10 GB/s of bandwidth • ~1K core throughput cluster (512 GB shared Magellan Cloud testbed HPSS Archival Storage memory) • IBM iDataplex cluster • 40 PB capacity Dirac GPU testbed (48 nodes) GenePool (JGI) • 4 Tape libraries • ~5K core throughput cluster • 150 TB disk cache 3

Lots of users, multiple systems, lots of data • At the end of the 90’s it was becoming increasingly clear that data management was a huge issue. • Users were generating larger and larger data sets and copying their data to multiple systems for pre- and post-processing. • Wasted time and wasted space • Needed to help users be more productive 4

Global Unified Parallel Filesystem • In 2001 NERSC began the GUPFS project. – High performance – High reliability – Highly scalable – Center-wide shared namespace • Assess emerging storage, fabric and filesystem technology • Deploy across all production systems 5

NERSC Global Filesystem (NGF) • First production in 2005 using GPFS – Multi-cluster support – Shared namespace – Separate data and metadata partitions – Shared lock manager – Filesystems served over Fibre Channel and Ethernet – Partitioned server space through private NSDs 6

NERSC Global Filesystem (NGF) Ethernet NGF Network Servers Franklin SAN Franklin NGF Franklin NGF SAN Disk Disk pNSD pNSD pNSD Hopper External Login IB IB IB Carver/ PDSF/ Hopper Hopper Dirac Euclid PDSF External Magellan Planck Filesystem 7

NGF Configuration • NSD servers are commodity – 28 core servers – 26 private NSD servers • 8 for hopper; 14 for carver; 8 for planck (PDSF) • Storage is heterogeneous – DDN 9900 for data LUNs – HDS 2300 for data and metadata LUNs – Have also used Engenio and Sun • Fabric is heterogeneous – FC-8 and 10 GbE for data transport – Ethernet for control/metadata traffic 8

NGF Filesystems • Collaborative - /project – 873 TB, ~12 GB/s, served over FC-8 – 4 DDN 9900 • Scratch - /global/scratch – 873 TB, ~12 GB/s, served over FC-8 – 4 DDN 9900s • User homes – /global/u1, /global/u2 – 40 TB, ~3-5 GB/s, served over Ethernet – HDS 2300 • Common area - /global/common, syscommon – ~5 TB, ~3-5 GB/s, served over Ethernet – HDS 2300 9

NGF /project Franklin Hopper Sea * Gemini FC (8x4xFC8) FC (20x2xFC4) DVS DVS pNSD IB /project 870TB (~12 GB/s) ) 4 C F x 2 x 2 ( C F 730TB increase July11 DTNs FC (12x4xFC8) Euclid FC (4x2xFC4) GPFS IB Server pNSD pNSD Dirac SGNs PDSF Planck Carver Magellan 10

NGF global scratch Franklin Hopper Gemini FC (8x4xFC8) DVS pNSD IB /global/scratch 870TB (~12 GB/s) ) 4 C F x 2 x 2 ( C F No increase planned DTNs FC (12x4xFC8) Euclid FC (4x2xFC4) IB pNSD pNSD Dirac SGNs PDSF Planck Carver Magellan 11

NGF global homes Franklin Hopper /global/homes 40 TB 40TB increase July11 DTNs Euclid FC GPFS Dirac Server Ethernet (4x1-10Gb) SGNs Carver Magellan PDSF Planck 12

Hopper Configuration DVS NGF DVS/DSL pNSD Server LNET GPFS pNSD Server MOM pNSD Server Storage pNSD Server pNSD Server pNSD Server GPFS Main pNSD Server Metadata pNSD Server System NERSC 10GbE LAN to HPSS LSI 3992 12 External Login Servers RAID 2 Spare RAID 1+0 1+0 2 MDS 4 esDM QDR Switch Fabric Servers 52 OSS FC Switch Fabric 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs LUNs 13

DVS on Hopper • 16 DVS servers for NGF filesystems – IB connected to private NSD servers – GPFS remote cluster serving compute and MOM nodes – 2 DVS nodes dedicated to MOMs – Cluster parallel • 32 DVS DSL servers on repurposed compute nodes – Loadbalanced for shared root 14

pNSD servers to /global/scratch (idle) 11000 10800 10600 10400 10200 MB/s 10000 9800 Write 9600 Read 9400 9200 9000 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 MB Block Size 4 MB Block Size 8 MB Block Size 16 MB Block Size #I/O processes per block size 15

pNSD servers to /global/scratch (busy) 9000 8000 7000 6000 5000 MB/s 4000 Write 3000 Read 2000 1000 0 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1MB Block Size 4MB Block Size 8MB Block Size 16MB Block Size # I/O processes per block size 16

DVS servers to /global/scratch (idle) 11400 11200 11000 10800 10600 10400 MB/s 10200 Write Read 10000 9800 9600 9400 9200 1 2 4 8 # I/O processes - block size 4MB 17

Hopper compute nodes to /global/scratch (idle) 12000 10000 8000 MB/s 6000 Write Read 4000 2000 0 24 48 96 192 384 768 3072 # I/O processes - packed nodes 18

Hopper compute nodes to /global/scratch (busy) 8000 7000 6000 5000 MB/s 4000 Write 3000 Read 2000 1000 0 24 48 96 192 384 768 1536 3072 #I/O processes - packed nodes 19

Hopper Filesystems • External Lustre – 2 local scratch filesystems – 2+ PBs user storage – Aggregate 70 GB/s • External nodes – 26 LSI 7900 – 52 OSSes with 6 OSTs per OSS – 4 MDS with failover • 56 LNET routers 20

IOR 2880 MPI Tasks MPI-IO Aggregate 60000 50000 40000 MB/s 30000 Write Read 20000 10000 0 10000 1000000 1048576 Block size 21

IOR 2880 MPI Tasks File Per Processor -- Aggregate 73000 72000 71000 70000 MB/s 69000 68000 Write Read 67000 66000 65000 64000 10000 1000000 1048576 Block size 22

Hopper compute nodes to /scratch (lustre) 40000 35000 30000 25000 MB/s 20000 Write 15000 Read 10000 5000 0 24 48 96 192 384 768 1536 3072 #I/O processes - packed nodes 23

Conclusions • The mix of dedicated external Lustre and shared NGF filesystems works well for user workflows with mostly good performance. • Shared file I/O is an issue for both Lustre and DVS-served filesystems. • Cray and NERSC working together on DVS and shared file I/O issues through Center of Excellence. 24

Acknowledgments This work was supported by the Director, Office of Science, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy under contract number DE-AC02-05CH11231. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy. 25

DVS, GPFS and External Lustre at NERSC How Its Working on Hopper - PowerPoint PPT Presentation

DVS, GPFS and External Lustre at NERSC How Its Working on Hopper Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011 1 NERSC is the Primary Computing Center for DOE Office of Science NERSC serves a large population

1 A Lustre V6 tutorial Verimag December 5, 2008 - Outline Lustre Lustre V6 The Lustre V6

GPFS on a Cray XT Shane Canon Data Systems Group Leader Lawrence Berkeley National Laboratory

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

Overview of Lustre Usage on JUROPA 26 September 2011 | Frank Heckes, FZ Jlich, JSC Lustre

Lustre Background Why Lustre Failover ? How does Lustre Failover work ? Automation

Tapes Not Dead At LBNL/NERSC Nick Balthaser MSST 2019 May 21, 2019 Storage @NERSC

Mendel at NERSC: Multiple Workloads on a Single Linux Cluster Larry Pezzaglia NERSC

UPDATE ON NERSC PScheD EXPERIENCES, A CONTINUING SUCCESS STORY Tina Butler - NERSC Brent Draney

Develop Value Systems de Mxico S.C. Develop Value Systems de Mxico S.C. DVS de Mxico @2018

What a Lustre Cluster (Improving and Tracing Lustre Metadata) yaaaasss Team Saffron Amanda

An Experiment With Lustre and Real-Time Calculus Introduction du cours Matthieu Moy Verimag

The Lustre Centre of Excellence at ORNL Makia Minich Clustre Monkey, HPC Software Stack Lustre

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS David Henty and Adrian Jackson (EPCC, The

IO on Lustre and GPFS David Henty and Adrian Jackson (EPCC, The University of Edinburgh) Charles

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS David Henty and Adrian Jackson (EPCC, The

Accelerating Experimental Workflows on NERSC systems Katie Antypas NERSC Division Deputy

Binder tude du mcanisme de communication interprocessus d'Android et de ses vulnrabilits

Anisotropic Diffusion in SPH Sergei Biriukov Supervisor: Daniel Price <latexit

Extreme Machine Learning with GPUs John Canny Computer Science Division University of

Future of Auditing: Audit Quality, Implementation and Innovation Commentary by: Warren Allen ,

NAT Behavioral Requirements for TCP Saikat Guha, Kaushik Biswas, Bryan Ford, Paul Francis,

Efficient Data Management and Statistics with Zero-Copy Integration Jonathan Lajus & Hannes

Retrofitting Parallelism onto OCaml KC Sivaramakrishnan , Stephen Dolan, Leo white, Sadiq Jaffer,

Defining Encryption Lecture 2 1 Roadmap 2 Roadmap First, Symmetric Key Encryption 2 Roadmap

DVS, GPFS and External Lustre at NERSC How Its Working on Hopper - PowerPoint PPT Presentation

DVS, GPFS and External Lustre at NERSC How Its Working on Hopper Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011 1 NERSC is the Primary Computing Center for DOE Office of Science NERSC serves a large population

1 A Lustre V6 tutorial Verimag December 5, 2008 - Outline Lustre Lustre V6 The Lustre V6

GPFS on a Cray XT Shane Canon Data Systems Group Leader Lawrence Berkeley National Laboratory

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

Overview of Lustre Usage on JUROPA 26 September 2011 | Frank Heckes, FZ Jlich, JSC Lustre

Lustre Background Why Lustre Failover ? How does Lustre Failover work ? Automation

Tapes Not Dead At LBNL/NERSC Nick Balthaser MSST 2019 May 21, 2019 Storage @NERSC

Mendel at NERSC: Multiple Workloads on a Single Linux Cluster Larry Pezzaglia NERSC

UPDATE ON NERSC PScheD EXPERIENCES, A CONTINUING SUCCESS STORY Tina Butler - NERSC Brent Draney

Develop Value Systems de Mxico S.C. Develop Value Systems de Mxico S.C. DVS de Mxico @2018

What a Lustre Cluster (Improving and Tracing Lustre Metadata) yaaaasss Team Saffron Amanda

An Experiment With Lustre and Real-Time Calculus Introduction du cours Matthieu Moy Verimag

The Lustre Centre of Excellence at ORNL Makia Minich Clustre Monkey, HPC Software Stack Lustre

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS David Henty and Adrian Jackson (EPCC, The

IO on Lustre and GPFS David Henty and Adrian Jackson (EPCC, The University of Edinburgh) Charles

PERFORMANCE OF PARALLEL IO ON LUSTRE AND GPFS David Henty and Adrian Jackson (EPCC, The

Accelerating Experimental Workflows on NERSC systems Katie Antypas NERSC Division Deputy

Binder tude du mcanisme de communication interprocessus d'Android et de ses vulnrabilits

Anisotropic Diffusion in SPH Sergei Biriukov Supervisor: Daniel Price &lt;latexit

Extreme Machine Learning with GPUs John Canny Computer Science Division University of

Future of Auditing: Audit Quality, Implementation and Innovation Commentary by: Warren Allen ,

NAT Behavioral Requirements for TCP Saikat Guha, Kaushik Biswas, Bryan Ford, Paul Francis,

Efficient Data Management and Statistics with Zero-Copy Integration Jonathan Lajus &amp; Hannes

Retrofitting Parallelism onto OCaml KC Sivaramakrishnan , Stephen Dolan, Leo white, Sadiq Jaffer,

Defining Encryption Lecture 2 1 Roadmap 2 Roadmap First, Symmetric Key Encryption 2 Roadmap

Anisotropic Diffusion in SPH Sergei Biriukov Supervisor: Daniel Price <latexit

Efficient Data Management and Statistics with Zero-Copy Integration Jonathan Lajus & Hannes