cse 262 lecture 13
play

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous - PowerPoint PPT Presentation

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous processing Announcements Final presentations u Friday March 13 th , 10:00 AM to 1:00PM Note time change. u Room 3217, CSE Building (EBU3B) Scott B. Baden / CSE 262


  1. CSE 262 Lecture 13 Communication overlap – Continued Heterogeneous processing

  2. Announcements • Final presentations u Friday March 13 th , 10:00 AM to 1:00PM Note time change. u Room 3217, CSE Building (EBU3B) Scott B. Baden / CSE 262 / UCSD, Wi '15 2

  3. An alternative way to hide communication • Reformulate MPI code into a data-driven form u Decouple scheduling and communication handling from the application u Automatically overlap communication with computation Runtime system Irecv j Irecv i Communication Worker 0 handlers Send j ¡ Send i threads 2 Wait Wait 1 3 4 Comp Comp Task dependency SPMD MPI Dynamic scheduling graph Scott B. Baden / CSE 262 / UCSD, Wi '15 3

  4. Bamboo Programming Model 1 #pragma bamboo olap • Olap-regions : task switching point 2 { u Data availability is checked at entry 3 #pragma bamboo send u Only 1 olap may be active at a time 4 {…} u When a task is ready, some olap region’s 5 #pragma bamboo receive input conditions have been satisfied 6 {… } 7 } • Send blocks u Hold send calls only … Computation u Enable the olap-region …. • Receive blocks : u Hold receive and/or send calls 10 #pragma bamboo olap u Receive calls are input to olap-region 11 {…} …. u Send calls are output to an olap-region • Activities in send blocks must be φ 1 φ 2 … φ N independent of those in receive blocks … OLAP 1 OLAP N • MPI_Wait/MPI_Waitall can reside anywhere within the olap-region Scott B. Baden / CSE 262 / UCSD, Wi '15 4

  5. Results • Stampede at TACC u 102,400 cores; dual socket Sandy Bridge processors u K20 GPUs • Cray XE-6 at NERSC (Hopper) u 153,216 cores; dual socket 12-core Magny Cours u 4 NUMA nodes per Hopper node, each with 6 cores u 3D Toroidal Network • Cray XC30 at NERSC (Edison) u 133,824 cores; dual socket 12-core Ivy Bridge u Dragonfly Network Scott B. Baden / CSE 262 / UCSD, Wi '15 5

  6. Stencil application performance (Hopper) • Solve 3D Laplace equation, Dirichlet BCs (N=3072 3 ) 7-point stencil Δ u = 0, u=f on ∂Ω • Added 4 Bamboo pragmas to a 419 line MPI code 40 MPI-basic MPI-olap 35 TFLOPS/s Bamboo-basic MPI-nocomm 30 25 20 15 10 5 0 12288 24576 49152 98304 Scott B. Baden / CSE 262 / UCSD, Wi '15 6

  7. 2D Cannon - Weak scaling study Edison 1400 TFLOPS/s MPI-basic 1200 MPI-olap 1000 Bamboo 800 MPI-nocomm 600 400 200 0 4096 16384 65536 Cores N0/4 2/3 N0/4 1/3 N=N0=196608 • Communication cost: 11% - 39% • Bamboo improves MPI-basic 9%-37% • Bamboo outperforms MPI-olap at scale Scott B. Baden / CSE 262 / UCSD, Wi '15 7

  8. Communication Avoiding Matrix Multiplication (Hopper) • Pathological matrices in Planewave basis methods for ab- initio molecular dynamics (N g 3 x N e ), For Si: N g =140, Ne=2000 • Weak scaling study, used OpenMP, 23 pragmas, 337 lines 210.4TF 110 100 90 MPI+OMP 80 TFLOPS/s MPI+OMP-olap 70 Bamboo+OMP 60 50 MPI-OMP-nocomm 40 30 20 10 0 Cores 4096 8192 16384 32768 N=2 2/3 N 0 N= 2 1/3 N 0 N=2N 0 Matrix size N=N 0 = 20608 Scott B. Baden / CSE 262 / UCSD, Wi '15 8

  9. Virtualization Improves Performance +,-"'!!**" '$!" +,-"!)&./" '$'&" c=2, VF=8 +,-")%'&!" '$'" +,-"%*(#)" 0 ?@A " c=2, VF=4 '$#&" '" c=2, VF=2 #$%&" c=4, VF=2 #$%" #$*&" '" !" )" *" 647892:4;2<=+">258=7" B1!C"D2E1==F?@A"" Cannon 2.5D (MPI) Jacobi (MPI+OMP) Scott B. Baden / CSE 262 / UCSD, Wi '15 9

  10. Virtualization Improves Performance ($!" +,-"'!!**" -./"(#)+" ($+" -./"+#'," '$!" +,-"!)&./" ($*" -./(,*&+" '$'&" +,-")%'&!" ($)" 0.1123." '$'" +,-"%*(#)" 0 ?@A " ($(" '$#&" (" '" #$'" #$%&" #$&" #$%" #$%" #$*&" (" )" +" &" (," 45673895:8;<-"=8>7<6" '" !" )" *" 647892:4;2<=+">258=7" B1!C"D2E1==F?@A"" Cannon 2D (MPI) Jacobi (MPI+OMP) Scott B. Baden / CSE 262 / UCSD, Wi '15 10

  11. High Performance Linpack (HPL) on Stampede • Solve systems of linear equations using LU factorization TFLOP/S • Latency-tolerant lookahead 28 Basic code is complicated 27.5 Unprioritized Scheduling 27 Prioritized Scheduling 26.5 Olap !"#"$%&'()*+,(-.(/( 26 25.5 1( / !" !"#"$%&'()*+,(-.(0( 25 24.5 0 !" 2 ! (3(2 ! (4(0 ! 5/ !" 24 • Results 23.5 • Bamboo meets the performance of the 23 highly-optimized version of HPL 22.5 22 • Uses the far simpler non-lookahead 21.5 version 139264 147456 155648 163840 172032 180224 • Task prioritization is crucial Matrix size • Bamboo improves the baseline version of HPL by up to 10% 2048 cores on Stampede Scott B. Baden / CSE 262 / UCSD, Wi '15 11

  12. Bamboo on multiple GPUs • MPI+CUDA programming model u CPU is host and GPU works as a device u Host-host with MPI and host-device with CUDA Send/recv MPI0 MPI1 u Optimize MPI and CUDA portions separately • Need a GPU-aware programming model cudaMemcpy u Allow a device to transfer data to another device GPU0 GPU1 u Compiler and runtime system handle the data transfer MPI+CUDA u Hide both host-device and host-host communication automatically MPI 0 MPI 1 RTS RTS RTS handles communication GPU Send/recv GPU task x task y GPU-aware model Scott B. Baden / CSE 262 / UCSD, Wi '15 12

  13. 3D Jacobi – Weak Scaling Study GFLOP/S • Results on Stampede 1400 u Bamboo-GPU MPI-basic outperforms MPI-basic 1200 u Bamboo-GPU and MPI- MPI-olap olap hide most 1000 communication overheads Bamboo • Bamboo-GPU improves 800 Bamboo-GPU performance by 600 MPI-nocomm u Hide Host-Host transfer u Hide Host-Device transfer 400 u Tasks residing in the same 200 GPU send address of the message 0 4 8 16 32 GPU count Stampede Scott B. Baden / CSE 262 / UCSD, Wi '15 13

  14. Multigrid – Weak Scaling Time (secs) Edison 3.5 • A geometric multigrid solver to MPI Bamboo MPI-nocomm 3 Helmholtz’s equation [Willams et al. 2.5 12] 2 Vcycle: restrict, smooth, solve, u 1.5 interpolate, smooth 1 Smooth: Red-Black Gauss-Seidel u 0.5 DRAM avoiding with the wave-front 0 u method Cores 2048 4096 8192 16384 32768 Results: Cores Comm Compute pack/unpack inter-box copy Comm/total time at each level § Communication cost: 16%-22% L0 L1 L2 L3 L4 § Bamboo improves the performance by up to 14% 2048 0.448 1.725 0.384 0.191 12% 21% 36% 48% 48% 4096 0.476 1.722 0.353 0.191 12% 24% 37% 56% 50% § Communication overlap is effective on levels L0 and L1 8192 0.570 1.722 0.384 0.191 13% 27% 45% 69% 63% 16384 0.535 1.726 0.386 0.192 12% 30% 48% 53% 49% 32768 0.646 1.714 0.376 0.189 17% 28% 44% 63% 58%

  15. A GPU-aware programming model • MPI+CUDA programming model u CPU is host and GPU works as a device u Host-host with MPI and host-device with CUDA u Optimize MPI and CUDA portions separately Send/recv MPI0 MPI1 • A GPU-aware programming model cudaMemcpy u Allow a device to transfer data to another device u Compiler and runtime system handle the data GPU0 GPU1 transfer MPI+CUDA u We implemented a GPU-aware runtime system u Hide both host-device and host-host communication automatically MPI 0 MPI 1 RTS RTS RTS handles communication GPU Send/recv GPU task x task y GPU-aware

  16. 3D Jacobi – Weak Scaling • Results u Bamboo-GPU GFLOP/s Stampede outperforms MPI-basic 1400 u Bamboo-GPU and MPI- MPI-basic olap hide most 1200 communication overheads MPI-olap • Bamboo-GPU improves 1000 Bamboo performance by 800 u Hide Host-Host transfer Bamboo-GPU u Hide Host-Device transfer 600 MPI-nocomm u Tasks residing in the same GPU send address of the 400 message 200 0 4 8 16 32 GPU count

  17. Bamboo Design • Core message passing Support point-to-point routines u Require programmer annotation u Employ Tarragon runtime system [Cicotti u 06, 11] • Subcommunicator layer Support MPI_Comm_split u No annotation required Bamboo implementation User-defined subprograms u • Collectives of collective routines A framework to translate collectives u Collective Collectives Implement common collectives u No annotation required u Subcommunicator • User-defined subprograms A normal MPI program u Core message passing Scott B. Baden / CSE 262 / UCSD, Wi '15 17

  18. Bamboo Translator Annotated MPI input MPI reordering EDG front-end Inlining Annotation handler ROSE AST Outlining Analyzer Translating MPI extractor … Transformer Bamboo middle-end Optimizer ROSE back-end Tarragon Scott B. Baden / CSE 262 / UCSD, Wi '15 18

  19. Bamboo Transformations • Outlining u TaskGraph definition: fill various Tarragon methods with input source code blocks • MPI Translation : capture MPI calls and generate calls to Tarragon u Some MPI calls removed, e.g. Barrier(), Wait() u Conservative static analysis to determine task dependencies • Code reordering : reorder certain code to accommodate Tarragon semantics Scott B. Baden / CSE 262 / UCSD, Wi '15 19 ¡

Recommend


More recommend