Capacity Planning of Supercomputers Simulating MPI Applications at - PowerPoint PPT Presentation

Capacity Planning of Supercomputers Simulating MPI Applications at Scale Tom Cornebize Under the supervision of Arnaud Legrand 21 June 2017 Laboratoire d’Informatique de Grenoble Ensimag - Grenoble INP

Introduction

Top500 Sunway TaihuLight, China, #1 Tianhe-2, China, #2 93 Pflops 34 Pflops Custom five level hierarchy Fat tree 40 , 950 × 260 cores 32 , 000 × 12 cores + 48 , 000 Xeon Phi Piz Daint, Switzerland, #3 Stampede, United States, #20 20 Pflops 5 Pflops Dragonfly Fat tree 5 , 272 × ( 8 cores + 1 GPU ) 6 , 400 × ( 8 cores + 1 Xeon Phi ) 1/19

High Performance LINPACK (HPL) Benchmark used to establish the Top500 LU factorization, A = L × U 3 N 3 + 2 N 2 + O ( N ) Complexity: flop ( N ) = 2 allocate the matrix U for k = N to 0 do allocate the panel various functions (max, swap,…) A compute the inverse broadcast L update N 2/19

Open questions in HPC • Topology (torus, fat tree, dragonfly, etc.) • Routing algorithm • Scheduling (when? where?) • Workload (job size, behavior) Keywords: capacity planning, co-design Simulation may help 3/19

On-line Simgrid: both approaches Simulation of HPC applications Off-line - P5: MPI_Recv at t=0.872s - P3: MPI_Wait at t=0.881s - P7: MPI_Send at t=1.287s - P5: MPI_Recv at t=1.568s - P7: MPI_Send at t=2.221s - P0: MPI_Recv at t=2.559s - P3: MPI_Wait at t=2.602s - P0: MPI_Send at t=3.520s - P1: MPI_Recv at t=4.257s - P2: MPI_Recv at t=4.514s - P6: MPI_Send at t=5.017s - P7: MPI_Recv at t=5.989s - P6: MPI_Recv at t=5.997s - P4: MPI_Send at t=6.107s - P6: MPI_Recv at t=6.534s - P2: MPI_Send at t=7.152s - P4: MPI_Recv at t=7.754s [...] t 4/19

Simgrid: both approaches Simulation of HPC applications Off-line On-line - P5: MPI_Recv at t=0.872s - P3: MPI_Wait at t=0.881s P0 P1 P2 P3 P4 P5 P6 P7 - P7: MPI_Send at t=1.287s - P5: MPI_Recv at t=1.568s - P7: MPI_Send at t=2.221s - P0: MPI_Recv at t=2.559s - P3: MPI_Wait at t=2.602s - P0: MPI_Send at t=3.520s - P1: MPI_Recv at t=4.257s - P2: MPI_Recv at t=4.514s - P6: MPI_Send at t=5.017s - P7: MPI_Recv at t=5.989s - P6: MPI_Recv at t=5.997s - P4: MPI_Send at t=6.107s - P6: MPI_Recv at t=6.534s - P2: MPI_Send at t=7.152s - P4: MPI_Recv at t=7.754s [...] t t 4/19

Simulation of HPC applications Off-line On-line - P5: MPI_Recv at t=0.872s - P3: MPI_Wait at t=0.881s P0 P1 P2 P3 P4 P5 P6 P7 - P7: MPI_Send at t=1.287s - P5: MPI_Recv at t=1.568s - P7: MPI_Send at t=2.221s - P0: MPI_Recv at t=2.559s - P3: MPI_Wait at t=2.602s - P0: MPI_Send at t=3.520s - P1: MPI_Recv at t=4.257s - P2: MPI_Recv at t=4.514s - P6: MPI_Send at t=5.017s - P7: MPI_Recv at t=5.989s - P6: MPI_Recv at t=5.997s - P4: MPI_Send at t=6.107s - P6: MPI_Recv at t=6.534s - P2: MPI_Send at t=7.152s - P4: MPI_Recv at t=7.754s [...] t t Simgrid: both approaches 4/19

Requirement for the emulation of Stampede’s execution: 3 875 000 2 • 8 bytes 120 terabytes of memory • 6 006 2 hours 500 days Very optimistic Objective: simulation of Stampede’s execution of HPL Real execution: • Matrix of size 3,875,000 • Using 6,006 MPI processes • About 2 hours 5/19

Objective: simulation of Stampede’s execution of HPL Real execution: • Matrix of size 3,875,000 • Using 6,006 MPI processes • About 2 hours Requirement for the emulation of Stampede’s execution: • ≥ 3 , 875 , 000 2 × 8 bytes ≈ 120 terabytes of memory • ≥ 6 , 006 × 2 hours ≈ 500 days Very optimistic 5/19

Scalable HPL simulation

Methodology Several optimizations. For each of them: • Evaluate the (possible) loss of prediction accuracy • Evaluate the (possible) gain of performance Publicly available: • Laboratory notebook • Modified HPL • Scripts • Modifications to Simgrid (integrated in the main project) 6/19

Solution: modeling these functions to inject their duration 10 T dgemm M N K M N K 1 706348 10 N 2 11 T dtrsm M N M 8 624970 10 Computation kernel sampling dgemm and dtrsm allocate the matrix for k = N to 0 do ≥ 90 % of the simulation time allocate the panel various functions (max, swap,…) compute the inverse broadcast update 7/19

10 T dgemm M N K M N K 1 706348 10 N 2 11 T dtrsm M N M 8 624970 10 Computation kernel sampling dgemm and dtrsm allocate the matrix for k = N to 0 do ≥ 90 % of the simulation time allocate the panel various functions (max, swap,…) Solution: modeling these functions to inject compute the inverse broadcast their duration update 7/19

Computation kernel sampling dgemm and dtrsm allocate the matrix for k = N to 0 do ≥ 90 % of the simulation time allocate the panel various functions (max, swap,…) Solution: modeling these functions to inject compute the inverse broadcast their duration update Linear regression of dgemm Linear regression of dtrsm ● 5 ● ● ● ● 4 10 ● Time (seconds) ● Time (seconds) ● ● ● 3 ● ● ● ● 2 5 ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● 0 ● ● ● ● ● ● 0e+00 2e+10 4e+10 6e+10 8e+10 0e+00 2e+10 4e+10 6e+10 m * n * k m * n^2 T dgemm ( M , N , K ) = M × N × K × 1 . 706348 × 10 − 10 T dtrsm ( M , N ) = M × N 2 × 8 . 624970 × 10 − 11 7/19

Culprits: • Initialization and verification functions • Other BLAS and HPL functions Solution: just skip them Computation pruning allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast 68% of the simulation time spent in HPL update 8/19

Solution: just skip them Computation pruning allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast 68% of the simulation time spent in HPL update Culprits: • Initialization and verification functions • Other BLAS and HPL functions 8/19

Computation pruning allocate the matrix for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast 68% of the simulation time spent in HPL update Culprits: • Initialization and verification functions • Other BLAS and HPL functions Solution: just skip them 8/19

Solution: use SMPI_SHARED_MALLOC physical virtual Reducing the memory consumption allocate the matrix for k = N to 0 do Memory consumption still too large allocate the panel various functions (max, swap,…) compute the inverse broadcast update 9/19

physical virtual Reducing the memory consumption allocate the matrix for k = N to 0 do Memory consumption still too large allocate the panel various functions (max, swap,…) Solution: use SMPI_SHARED_MALLOC compute the inverse broadcast update 9/19

Reducing the memory consumption allocate the matrix for k = N to 0 do Memory consumption still too large allocate the panel various functions (max, swap,…) Solution: use SMPI_SHARED_MALLOC compute the inverse broadcast update physical virtual 9/19

Must remain contiguous can be shared can be shared matrix parts matrix parts indices must not be shared Solution: SMPI_PARTIAL_SHARED_MALLOC Arbitrary number of shared and private blocks. Reducing the memory consumption #154 allocate the matrix Problem: panel buffers for k = N to 0 do allocate the panel various functions (max, swap,…) compute the inverse broadcast update 10/19

can be shared can be shared matrix parts matrix parts indices must not be shared Solution: SMPI_PARTIAL_SHARED_MALLOC Arbitrary number of shared and private blocks. Reducing the memory consumption #154 allocate the matrix Problem: panel buffers for k = N to 0 do allocate the panel Must remain contiguous various functions (max, swap,…) compute the inverse broadcast update 10/19

can be shared can be shared must not be shared Solution: SMPI_PARTIAL_SHARED_MALLOC Arbitrary number of shared and private blocks. Reducing the memory consumption #154 allocate the matrix Problem: panel buffers for k = N to 0 do allocate the panel Must remain contiguous various functions (max, swap,…) compute the inverse broadcast update matrix parts matrix parts indices 10/19

Capacity Planning of Supercomputers Simulating MPI Applications at - PowerPoint PPT Presentation

Capacity Planning of Supercomputers Simulating MPI Applications at Scale Tom Cornebize Under the supervision of Arnaud Legrand 21 June 2017 Laboratoire dInformatique de Grenoble Ensimag - Grenoble INP Introduction Top500 Sunway

Supercomputers and Supercomputers and Clusters and Clusters and Grid, Grid, Oh My! Oh My!

Black-hole simulations on supercomputers U. Sperhake DAMTP , University of Cambridge DAMTP ,

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on

The Integrative Role of COWs and Supercomputers in Research and Education Activities Don

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and

Black-hole binary simulations on supercomputers U. Sperhake CSIC-IEEC Barcelona 2 nd Iberian

LARGE SCALE VISUALIZATION ON GPU ACCELERATED SUPERCOMPUTERS Peter Messmer, 11/16/2015

Chapter 4 Planning Capacity Capacity Strategies Determining Capacity Requirements

IHO Capacity Building Strategy The three stages of development of hydrographic capacity

Visitation Capacity Planning Visitation Capacity Planning for PublicVenues PublicVenues for

Classical Planning Systems ICS 271 Fall 2014 Outline: Planning Planning environments

Load Test of Load Test of High Capacity Micropile Micropile High Capacity in Site in Site

Comparison of Installed Capacity (ICAP) & Unforced Capacity (UCAP) Capacity Value Calculation

CAPACITY BUILDING OBJECTIVES FOR THIS TRAINING MANUAL CAPACITY BUILDING Capacity building is

Main Restaurant Seating Capacity Sit down 60 Stand up 150 Terrace Seating Capacity Sit down

PCB Technical Capacity (2012.10 Updated) No. Item Capacity for sample Capacity for small &

Introduction to Performance Analysis Visualization and Analysis of Performance on Large-scale

BOAST Performance Portability Using Meta-Programming and Auto-Tuning Frdric Desprez 1 , Brice

Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua University NSCC-Wuxi

Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Large scale computing

ENABLING LOW-COST AND LIGHTWEIGHT ZERO-COPY OFFLOADING ON HETEROGENEOUS MANY CORE

Data- Intensive

and Zonal Field Generation Z. Lin University of California, Irvine Fusion Simulation Center,

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

Capacity Planning of Supercomputers Simulating MPI Applications at - PowerPoint PPT Presentation

Capacity Planning of Supercomputers Simulating MPI Applications at Scale Tom Cornebize Under the supervision of Arnaud Legrand 21 June 2017 Laboratoire dInformatique de Grenoble Ensimag - Grenoble INP Introduction Top500 Sunway

Supercomputers and Supercomputers and Clusters and Clusters and Grid, Grid, Oh My! Oh My!

Black-hole simulations on supercomputers U. Sperhake DAMTP , University of Cambridge DAMTP ,

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on

The Integrative Role of COWs and Supercomputers in Research and Education Activities Don

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and

Black-hole binary simulations on supercomputers U. Sperhake CSIC-IEEC Barcelona 2 nd Iberian

LARGE SCALE VISUALIZATION ON GPU ACCELERATED SUPERCOMPUTERS Peter Messmer, 11/16/2015

Chapter 4 Planning Capacity Capacity Strategies Determining Capacity Requirements

IHO Capacity Building Strategy The three stages of development of hydrographic capacity

Visitation Capacity Planning Visitation Capacity Planning for PublicVenues PublicVenues for

Classical Planning Systems ICS 271 Fall 2014 Outline: Planning Planning environments

Load Test of Load Test of High Capacity Micropile Micropile High Capacity in Site in Site

Comparison of Installed Capacity (ICAP) &amp; Unforced Capacity (UCAP) Capacity Value Calculation

CAPACITY BUILDING OBJECTIVES FOR THIS TRAINING MANUAL CAPACITY BUILDING Capacity building is

Main Restaurant Seating Capacity Sit down 60 Stand up 150 Terrace Seating Capacity Sit down

PCB Technical Capacity (2012.10 Updated) No. Item Capacity for sample Capacity for small &amp;

Introduction to Performance Analysis Visualization and Analysis of Performance on Large-scale

BOAST Performance Portability Using Meta-Programming and Auto-Tuning Frdric Desprez 1 , Brice

Porting Atmospheric Programs on Various Systems Lin Gan Tsinghua University NSCC-Wuxi

Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Large scale computing

ENABLING LOW-COST AND LIGHTWEIGHT ZERO-COPY OFFLOADING ON HETEROGENEOUS MANY CORE

Data- Intensive

and Zonal Field Generation Z. Lin University of California, Irvine Fusion Simulation Center,

Opportunities for Parallelism Dr. Michael K. Bane HIGH END COMPUTE Questions 1. What do you

Comparison of Installed Capacity (ICAP) & Unforced Capacity (UCAP) Capacity Value Calculation

PCB Technical Capacity (2012.10 Updated) No. Item Capacity for sample Capacity for small &