Sequoia and the Petascale Era SCICOMP 15 May 20, 2009 Thomas - - PDF document

sequoia and the petascale era
SMART_READER_LITE
LIVE PREVIEW

Sequoia and the Petascale Era SCICOMP 15 May 20, 2009 Thomas - - PDF document

Lawrence Livermore National Laboratory Sequoia and the Petascale Era SCICOMP 15 May 20, 2009 Thomas Spelce Development Environment Group Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the


slide-1
SLIDE 1

Lawrence Livermore National Laboratory

Thomas Spelce

Development Environment Group

LLNL-PRES-411030 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Sequoia and the Petascale Era

SCICOMP 15 May 20, 2009

2

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

The Advanced Simulation and Computing (ASC) Program delivers high confidence prediction of weapons behavior

Integrated Codes Physics and Engineering Models Verification and Validation Codes to predict safety and reliability Models and understanding NNSA Science Campaigns Experiments Legacy UGTs Experiments provide critical validation data

ASC integrates all of the science and engineering that makes stewardship successful

slide-2
SLIDE 2

3

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

ASC pursued three classes of systems to cost effectively meet current (and anticipate future) compute requirements

  • Capability systems ==> the most

challenging integrated design calculations

  • More costly but proven
  • Production workload
  • Capacity systems ==> day to day work
  • Less costly, somewhat less reliable
  • Throughput for less demanding

problems

  • Advanced Architectures ==>

performance, power consumption, etc.

  • Targeted but demanding workload
  • Tomorrow’s mainstream solutions?

The “three curves” (Capability, Capacity and Advanced Architectures) approach has been successful in delivering good cost performance across the spectrum of need…

Performance

Time

FY01 FY05 Purple MCR White Q Peloton TLCC (Juno) BlueGene/L Roadrunner Red Blue Sequoia Low-cost capacity Original concept: develop capability Mainframes (RIP) Thunder Higher performance and lower power consumption 4

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

Sequoia represents largest increase in computational power ever delivered for NNSA Stockpile Stewardship

1/06 7/06 12/06 1/10 7/10 12/10

Market Survey CD0 Approved CD1 Approved Selection

1/07 7/07 12/07 1/08 7/08 12/08

Contract Package Sequoia Plan Review Dawn Early Science Transition to Classified Dawn GA Write RFP Sequoia Build Decision Sequoia Parts Commit & Option Sequoia Parts Build Sequoia Early Science Transition to Classified Sequoia Operational Readiness CD4 Approved

Sequoia Five Years Planned Lifetime Through CY17

Sequoia contract award

Phased System Deliveries

Sequoia final system acceptance

1/12 7/12 12/12 1/11 7/11 12/11

Sequoia Demo Dawn Phase 1 Dawn Phase 2

Dawn system acceptance

Vendor Response CD2/3 Approved Dawn LA

1/09 12/09 7/09

slide-3
SLIDE 3

5

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

“Dawn speeds a man on his journey, and speeds him too in his work” ...Hesiod (~700 B.C.E)

Dawn Specifications

  • IBM BG/P architecture
  • 36,864 compute nodes (500TF)
  • 147,456 PPC 450 cores
  • 4GB memory per node (147.5TB)
  • 128-to-1 compute to I/O node ratio
  • 288 10GE links to file system

Dawn Installation

  • Feb 27th - final rack delivery
  • March 5th - 36 Rack integration complete
  • March 15-24th – Synthetic WorkLoad start
  • End of March - Acceptance (planned)

ibm.com/systems/deepcomputing/bluegene/

6

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

14 TF/s 4 TB 36 KW Rack 36 racks 0.5 PF/s 144 TB 1.3 MW >8 Day MTBF System 13.6 GF/s 4.0 GB DDR2 13.6 GB/s Memory BW 0.75 GB/s 3D Torus BW Compute Card 850 MHz PPC 450 4 cores/4 threads 13.6 GF/s Peak 8 MB EDRAM Chip 435 GF/s 128 GB Node Card

DAWN

SEQUOIA Initial Delivery

slide-4
SLIDE 4

7

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

288 – 10GbE 14 – 1GbE 14 – 1GbE 2 – 1GbE 3 – 1GbE 1 – 10GbE 2 – 1GbE 4 x 4 – 1GbE 8 x 4 – 10GbE Dawn Core (9 x 4 BG/P Racks) 144 – 1GbE Primary Backup 2 – 1GbE 2 – FC4 2 – 10GbE 12 – 10GbE 2 – FC4 2 – FC4 2 – FC4 2 – 10GbE 2 – 10GbE Local Disk SERVICE SERVICE HMC SERVICE LOGIN HTC

LLNL

DAWN Initial Delivery Infrastructure

E-net Core

8

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

Sequoia Target Architecture and Infrastructure

Production Operation FY12-FY17

  • 20PF/s, 1.6 PB Memory
  • 96 racks, 98,304 nodes
  • 1.6 M cores (1 GB/core)
  • 50 PB Lustre file system
  • 6.0 MW power (160 times

more efficient than Purple)

  • Will be used as a 2D ultra-res

and 3D high-res Uncertainty Quantification (UQ) engine

  • Will be used for 3D science

capability runs exploring key materials science problems

slide-5
SLIDE 5

9

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

High performance material science simulations will contribute directly to ASC programmatic success

Six physics/materials science applications targeted for early implementation on Sequoia infrastructure

  • Qbox – Quantum molecular dynamics for

determination of material equation of state

  • DDCMD – Molecular dynamics for material

dynamics

  • Miranda – 3D Continuum fluid dynamics for

interfacial mixing

  • ALE3D – 3D Continuum mechanics for ignition

and detonation propagation of explosives

  • LAMPPS – Molecular dynamics for shock

initiation in high explosives

  • ParaDiS – Dislocation dynamics for high

pressure strength in materials

10

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

Single Sequoia Platform Mandatory Requirement is P ≥ 20

  • P is “peak” of the machine measured in petaFLOP/s
  • Target requirement is P + S ≥ 40
  • S is weighted average of five “marquee” benchmark codes
  • Four code package benchmarks

− UMT, IRS, AMG, and SPhot − Program goal is 24x the Purple capability throughput

  • One “science workload” benchmark from SNL

− LAMMPS (molecular dynamics) − Program goal is 20x-50x BGL for science capability

Purple - 100TF/s Purple - 100TF/s BlueGene /L – 367TF/s BlueGene /L – 367TF/s

slide-6
SLIDE 6

11

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

Sequoia Operating System Perspective

1-N CN… Light weight kernel on compute nodes

  • Optimized for scalability and reliability
  • As simple as possible
  • Extremely low OS noise
  • Direct access to interconnect hardware
  • OS features
  • Linux/Unix syscall compatible w/ I/O syscalls
  • Support for dynamic lib runtime loading
  • Shared memory regions
  • Open source

Linux/Unix OS on I/O nodes

  • Leverage large Linux/Unix base & community
  • Enhance TCP offload, PCIe, I/O
  • Standard File Systems - Lustre, NFSv4, etc.
  • Aggregates N CN for I/O & admin
  • Open source

Compute Nodes

Sequoia ION and Interconnect

Linux/Unix

FSD Perf tools totalview

Lustre Client NFSv4

SLURMD

MPI Application

GLIBC Sequoia CN and Interconnect NPTL Posix threads glibc dynamic loading ADI hardware transport RAS Futex syscalls Shared Memory

MPI Application

GLIBC Sequoia CN and Interconnect NPTL Posix threads glibc dynamic loading ADI hardware transport RAS Futex syscalls Shared Memory

MPI Application

GLIBC Sequoia CN and Interconnect NPTL Posix threads glibc dynamic loading ADI hardware transport RAS Futex syscalls Shared Memory

MPI Application

GLIBC Sequoia CN and Interconnect Posix threads, OpenMP, SE/TM glibc dynamic loading ADI hardware transport RAS Function Shipped syscalls SMP UDP TCP/IP

LNet

Function Shipped syscalls 12

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

Sequoia Software Stack – Applications Perspective

Code Development Tools

C/C++/Fortran Compilers, Python LWK, Linux/Unix

Optimized Math Libs APPLICATION

IP

UDP TCP SOCKETS Lustre Client

Clib/F03 runtime

MPI2

Interconnect Interface User Space Kernel Space ADI Parallel Math Libs External Network LNet

OpenMP, Threads, SE/TM

Function Shipped syscalls

SLURM/Moab RAS, Control System Code Dev Tools Infrastructure

slide-7
SLIDE 7

13

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

The tools that users know and love will be available on Sequoia with improvements and additions as needed

Infrastructure Debugging Performance

Features Operational Scale

Dyninst PAPI Stack Walker OpenMP Profiling Interface MRNet PMPI APAI DPCL Valgrind OTF

SE/TM Monitor

LaunchMON

STAT TV memlight memP TotalView ThreadCheck MemCheck SE/TM Debugger New Lightweight Focus Tools mpiP TAU O|SS OpenMP Analyzer gprof SE/TM Analyzer 105 - 106 - 107 - 104 - 1 -

Existing New

14

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

Application programming requirements and challenges

  • Availability of 1.6M cores pushes all-

MPI codes to extreme concurrency

  • Availability of many threads on many

SMP cores encourages low-level parallelism for higher performance

  • Mixed MPI/SMP programming

environment and possibility of heterogeneous compute distribution brings load imbalance to the fore

  • I/O and visualization requirements

encourage innovative strategies to minimize memory and bandwidth bottlenecks

MPI Scaling MPI Scaling SMP Threads SMP Threads I/O & Visualization I/O & Visualization Hybrid Models Hybrid Models

slide-8
SLIDE 8

15

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

MPI_FINALIZE

The RFP asked interested vendors to address a “Unified Nested Node Concurrency” model

  • MPI Tasks on a node are processes (one is shown) with multiple OS threads

(Thread0-3 shown)

  • Thread0 is “Main thread” & Thread1-3 are helper threads that morph from Pthread

to OpenMP worker to TM/SE compiler generated threads via runtime support

  • Hardware support to significantly reduce overheads for thread repurposing and

OpenMP loops and locks

MAIN

Thread0 Thread1 Thread2 Thread3

MPI_INIT Funct1 MPI Call 1-3 MPI Call Funct2 MPI Call MPI Call TM/SE TM/SE OpenMP 1-3 Funct1 MPI Call 1-3 MPI Call MAIN Exit OpenMP OpenMP OpenMP 1) Pthreads born with MAIN 2) Only Thread0 calls functions to nest parallelism 3) Pthreads based MAIN calls OpenMP based Funct1 4) OpenMP Funct1 calls TM/SE based Funct2 5) Funct2 returns to OpenMP based Funct1 6) Funct1 returns to Pthreads based MAIN MPI Call

W W W W W W

1-3 1-3 1-3 1-3

16

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

Previous systems have prepared the way for Sequoia

  • BG/L experience informs Dawn/Sequoia scalability
  • OpenMP & Posix threads experience on Linux/AIX
  • Integrated codes regularly run at Purple capability
  • Dawn will be used for code development
  • SMP parallelism
  • Python
  • Larger memory per core than BG/L
  • Some critical UQ analysis as well
  • Sequoia will be a Tri-Lab ASC resource
  • Video conferences for coordination

DAWN Initial Delivery

slide-9
SLIDE 9

17

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

A diverse team and a new Scalable Application Preparation Project ensure success on Sequoia

  • LC Hotline, User Training and Documentation

address routine issues

  • ADEPT team provides expertise in compilers,

debuggers, performance tools

  • Access to IBM experts, including an on-site IBM

applications analyst

  • Staff to work closely with the application teams
  • Ongoing ANL/IBM/LLNL BlueGene collaboration
  • Engaging third-party vendors, university research

partners, and the open source community

18

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

New Petascale Computing Enabling Technologies (PCET) LDRD is addressing key barriers to predictive simulation

Debugging 103 Cores Load Balance 104 Cores Fault Tolerance 105 Cores Multicore 106 Cores Vector FP Units/ Accelerators? 107 Cores Power? 108 Cores

Purple BG/L Petascale Exascale PCET creates essential capabilities for exascale core counts

slide-10
SLIDE 10

19

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

PCET strategy mitigates risk to assure immediate impact on application drivers and longer term success

Shorter Term Payoff

Load balance analysis Cache oblivious data layouts Checkpoint compression Behavioral and performance equivalence classes

Petascale capable & Exascale prepared

Multicore-aware algorithms Application-level fault tolerance Well-balanced application load Automated error analysis

Current capabilities

MPI large grain parallelism Basic checkpoint/restart Ill-defined load imbalances Debugging < 4096 cores

Terascale capabilities

Multicore-adapted algorithms Faster checkpoint/restart Understood load imbalances Targeted debugging

20

Lawrence Livermore National Laboratory

10th International LCI Conference

LLNL-PRES-411030

Take-away: Computational science on Sequoia at full- scale will be culmination of many years of hard work

Innovative or evolutionary architecture ideas R&D contracts Flexible contracts with targets as requirements Milestone progress Initial delivery & integration Computational science R&D Periodic reviews Rigorous review

We’re here with Dawn ID