exascale ability today
play

Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of Data - PowerPoint PPT Presentation

O N THE C OMMUNICATION C OMPLEXITY OF 3D FFT S AND ITS I MPLICATIONS FOR E XASCALE Kent Czechowski, Chris McClanahan, Casey Battaglino, Kartik Iyer, P .-K. Yeung, and Richard Vuduc Exascale-ability Today N=4096 3 12.3 10 12 Flops 1.1 TB of


  1. O N THE C OMMUNICATION C OMPLEXITY OF 3D FFT S AND ITS I MPLICATIONS FOR E XASCALE Kent Czechowski, Chris McClanahan, Casey Battaglino, Kartik Iyer, P .-K. Yeung, and Richard Vuduc

  2. Exascale-ability Today N=4096 3 12.3 ⨯ 10 12 Flops 1.1 TB of Data 3D FFT

  3. Exascale-ability Tomorrow N=131,072 3 .574 ⨯ 10 18 Flops 36 PB of Data 3D FFT

  4. vs. Swim lane 1 Swim lane 2 + = ? 3D FFT

  5. Performance Model

  6. 3D FFT Decompositions Problem size N=n x n x n

  7. Pencil Decomposition Problem size N=n x n x n

  8. Distributed 3D FFT - Performance Model ��������������������������������������������� ���������������� ������������ �������������� �������� ����������� ������������� ����� �������� ��������� ������������ �������������� �������� ����������� ������������� ����� �������� ��������� ������������ �������������� ����������� ��������

  9. Distributed 3D FFT - Performance Model ������������ �������������� ����� ����������� Each node computes n 2 /p 1D FFTs of size n Arithmetic Computation Time Memory Access Time T flops = 3 × n 2 3 × n 2 P × 5 n log n P · A × n (max(log Z n, 1 . 0)) T mem ≈ β mem C node Nodes: P Cache Capacity: Z Compute Throughput: C node Memory BW: β mem

  10. Distributed 3D FFT - Performance Model ������������ �������������� ����� ����������� Each node computes n 2 /p 1D FFTs of size n Arithmetic Computation Time Memory Access Time T flops = 3 × n 2 3 × n 2 P × 5 n log n P · A × n (max(log Z n, 1 . 0)) T mem ≈ β mem C node Nodes: P Cache Capacity: Z Compute Throughput: C node Memory BW: β mem

  11. Distributed 3D FFT - Performance Model ������������ �������������� ����� ����������� Each node computes n 2 /p 1D FFTs of size n Arithmetic Computation Time Memory Access Time T flops = 3 × n 2 3 × n 2 P × 5 n log n P · A × n (max(log Z n, 1 . 0)) T mem ≈ β mem C node f Θ (1 + ( n/L )(1 + log Z n )) Lower bound (Frigo 1999) Nodes: P Cache Capacity: Z Compute Throughput: C node Memory BW: β mem

  12. Distributed 3D FFT - Performance Model ������������� ����� ����� ��������� √ p-node All-to-All communications Network Time n 3 T net 2 × ≈ 2 3 β link P Nodes: P Network BW: β link

  13. Validation

  14. 3D FFT Software Optimized 1D FFT Library Distributed 3D FFT Framework ( ( FFTW + ESSL MKL p3dfft

  15. 3D FFT Software Optimized 1D FFT Library Distributed 3D FFT Framework ( ( FFTW + ESSL MKL CUFFT p3dfft

  16. Test Machines ������������������������������� Hopper Keeneland 6,392 Nodes 120 Nodes (3xGPUs per node) Opteron 6100 CPU Tesla M2070 GPU Processor Peak: 50.4 GF/s Processor Peak: 515 GF/s Cores: 6 Cores: 448 Memory BW: 21.3 GB/s Memory BW: 144 GB/s Fast Memory: 6 MB Fast Memory: 2.7 MB Link BW: 10 GB/s Link BW: 2 GB/s

  17. Artifacts

  18. GPU vs CPU Performance FFT Performance on Keeneland 700 525 Gflops/s 350 GPU CPU 175 0 0 750 1500 2250 3000 Problem Size

  19. GPU vs CPU Performance FFT Performance on Keeneland 700 20% Difference 525 Gflops/s 350 GPU CPU 175 0 0 750 1500 2250 3000 Problem Size

  20. Artifacts - PCIe Bottleneck Node CPU 1 1 integrated Infiniband DRAM Core0 Core1 I/O DDR3 QPI hub GPU 1 Core2 Core3 PCIe x16 QPI QPI PCIe x16 GPU 2 0 1 DRAM I/O hub GPU 3 2 3 PCIe x16 CPU 2

  21. Projections

  22. Predicting 2020 Technology Component Performance Relative to 2010 Technology 100.0000 (59x) Compute (32x) Cache (22x) Network BW 10.0000 (10x) Memory BW Relative Performance 1.0000 0.1000 0.0100 0.0010 ← Historical Projected → 0.0001 1990 2000 2010 2020 Year

  23. Technology Extrapolation 2020 2010 CPU-Based CPU-Based Processor Peak: 3 TF/s Processor Peak: 50.4 GF/s Memory BW: 206 GB/s Memory BW: 21.3 GB/s Fast Memory: 192 MB Fast Memory: 6 MB Link BW: 218 GB/s Link BW: 10 GB/s 1.3 M Processors 79,400 Processors GPU-Based GPU-Based Processor Peak: 515 GF/s Processor Peak: 30 TF/s Memory BW: 144 GB/s Memory BW: 1.4 TB/s Fast Memory: 2.7 MB Fast Memory: 86.4 MB Link BW: 10 GB/s Link BW: 218 GB/s 6,392 Processors 135,000 Processors

  24. 3D FFTs at Exascale (2020, n=21000) 3 − D FFTs at Exascale: Year=2020, n=21000 GPU CPU − 1: Same Peak 131k sockets 1M sockets Peak = 3.98 EF/s Peak = 3.98 EF/s Bisection = 1.12 PB/s Bisection = 5.29 PB/s 0.8 0.19 0.6 Time (seconds) Comm. Memory Network 0.4 0.528 0.2 0.116 0.112 0.0

  25. Performance vs Balance GPU CPU Memory Memory Memory BW Memory BW (144 GB/s) 6.7x (21.3 GB/s) Cache (2.7 MB) Cache (6 MB) 10.2x Processor Processor (515 GFlop/s) (50.4 GFlop/s) Flop/s : Byte/s = 2.3 Flop/s : Byte/s = 3.6 The GPU offers better performance, but is less balanced

  26. Two Costs: T memory + T network Memory Memory Memory Processor Processor Processor Memory Memory Memory Processor Processor Processor Memory Memory Memory Processor Processor Processor

  27. Two Costs: T memory + T network Memory Memory Memory Processor Processor Processor Memory Memory Memory Processor Processor Processor Memory Memory Memory Processor Processor Processor

  28. Two Costs: T memory + T network Memory Memory Memory Processor Processor Processor Memory Memory Memory Processor Processor Processor Memory Memory Memory Processor Processor Processor

  29. Impact of Machine Balance Mem Mem Mem Memory Memory Proc Proc Proc Processor Processor Mem Mem Mem vs. Proc Proc Proc Memory Memory Mem Mem Mem Processor Processor Proc Proc Proc Number of processors: R peak / C node ✓ ◆ 1 · C κ ✓ ◆ R peak · C node 1 β mem · n 3 log Z n β link · n 3 node T net ≈ O T mem ≈ O R κ peak

  30. Impact of Machine Balance Mem Mem Mem Memory Memory Proc Proc Proc Processor Processor Mem Mem Mem vs. Proc Proc Proc Memory Memory Mem Mem Mem Processor Processor Proc Proc Proc Number of processors: R peak / C node ✓ ◆ 1 · C κ ✓ ◆ R peak · C node 1 β mem · n 3 log Z n β link · n 3 node T net ≈ O T mem ≈ O R κ peak

  31. 3D FFTs at Exascale (2020, n=21000) 3 − D FFTs at Exascale: Year=2020, n=21000 GPU CPU − 1: Same Peak 131k sockets 1M sockets Peak = 3.98 EF/s Peak = 3.98 EF/s Bisection = 1.12 PB/s Bisection = 5.29 PB/s 0.8 0.19 0.6 Time (seconds) Comm. Memory Network 0.4 0.528 0.2 0.116 0.112 0.0

  32. 3D FFTs at Exascale (2020, n=21000) 3 − D FFTs at Exascale: Year=2020, n=21000 GPU CPU − 1: Same Peak CPU − 2: Same Total CPU − 3: Same Overlap 131k sockets 1M sockets 350k sockets 295k sockets Peak = 3.98 EF/s Peak = 3.98 EF/s Peak = 1.04 EF/s Peak = 876 PF/s Bisection = 1.12 PB/s Bisection = 5.29 PB/s Bisection = 2.16 PB/s Bisection = 1.93 PB/s 0.8 0.19 0.6 Time (seconds) 0.528 Comm. 0.444 Memory Network 0.4 0.528 0.2 0.116 0.307 0.274 0.112 0.0

  33. Questions?

  34. counts are scaled to reflect a 4 PF/s machine Doubling 10-year 2010 time increase Parameter values (in years) factor value Peak: 50.4 GF/s 1.7 59.0 × 3.0 TF/s C CPU 515 GF/s 30 TF/s C GPU Cores: a 6 1.87 40.7 × 134 ρ CPU 448 18k ρ GPU Memory 21.3 GB/s 3.0 9.7 × 206 GB/s β CPU bandwidth: 144 GB/s 1.4 TB/s β GPU Fast 6 MB 2.0 32.0 × 192 MB Z CPU 2.7 MB b memory 86.4 MB Z GPU Line size: 64 B 10.2 2.0 × 128 B L CPU 128 B 256 B L GPU Link 10 GB/s 2.25 21.8 × 218 GB/s β link bandwidth: Machine 4 PF/s 1.0 1000 × 4 EF/s R peak peak: System 635 TB 1.3 208 × 132 PB E memory: Nodes 79,400 2.4 17.4 × 1.3M P CPU R peak ( ): 7,770 135,000 P GPU C a

Recommend


More recommend