LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior - PowerPoint PPT Presentation

May 8-11, 2017 | Silicon Valley LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior Research Scientist. May 11, 2017

THE FAST FOURIER TRANSFORM Operation Count: 4 N log 2 N − 6 N + 8 2

SPLIT-RADIX FFT Algorithm 3

SPLIT-RADIX FFT Profile 4

FMM-FFT Edelman et al. 1999 5

STRUCTURED DENSE MATRICES AND FMM •SVD: A = U D V ∗ •Low-Rank: K = U ˜ K r × r V ∗ •Hierarchically LR: K IJ = U I ˜ K IJ V ∗ J •H-Semi-Separable: K IJ = U I ˜ I ˜ J ˜ V ∗ J V ∗ U ˜ K ˜ I ˜ ˜ J •H 2 -Matrix/FMM 6

FMM-FFT Algorithm M M,P = diag( I M , C 1 , . . . , C P − 1 ) ⇣ π n − m + p h ⇣ ⌘⌘ i [ C p ] mn = ρ p cot + ı M P } 2D M × P FFT 7

COT FMM ⇣ π n − m + p h ⇣ ⌘⌘ i [ C p ] mn = ρ p cot + ı M P • One dimensional • Uniform — integers are source/target • Periodic • Distributed • Size M-by-M • P of them! • Interleaved 8

FMM OPERATORS M2L B=2 M2L M2M Q L2L 3 M2L Q M2M L2L L2L L=4 Q L2T L2T S2M M/2 L S2T S2T • S: “Source” Each operator is an (implicit) matrix. • T: “Target” • M: “Multipole” • L: “Local” 9

PARAMETERS OF THE FMM-FFT • FFT N = M P • FMM ( N, P, M L , Q, B ) Q • Rank • Base level B • Leaf box size M L L = log 2 ( M/M L ) • Leaf level 10

DISTRIBUTED FMM All2All Gather Halo 2b Halo 2b Halo 1b All2All Gather Halo 2b Halo 2b Halo 1b 11

INTERPOLATIVE FMM ˜ ˜ ˜ ˜ C ij = ` m ( t I I I J n ) ` n ( s J J i ) ` q ( z m ) C ( z r ) ` r ( z i ) q , z M2L z − z k ✓ (2 j + 1) π ◆ Y ` i ( z ) = z j = cos z i − z k 2 Q 0  k<Q M2M L2L k 6 = i • Same operators across all boxes S2M L2T • Same operators across all levels • Almost same operators across all FMMs 12

TENSOR REPRESENTATIONS A ijk ` := A [ i + j ∗ ldA<1> + k ∗ ldA<2> + ` ∗ ldA<3> ] , • Input: S n ≡ S pm ≡ S pmb • Output: T n ≡ T pm ≡ T pmb 13

S2M/L2T s m = − 1 + 2 m + 1 S 2 M qm = ` q ( s m ) M L M L ( p − 1) qb = S 2 M qm S pmb Computed with single BatchedGEMM 14

BATCHED MATRIX-MATRIX MULTIPLY cublas<T>gemmStridedBatched in cuBLAS 8.0 15

S2M/L2T M pq [ b ] = S pm [ b ] S 2 M T M pqb = S 2 M qm S pmb = ⇒ qm T pmb = L 2 T mq L pqb T pm [ b ] = L pq [ b ] S 2 M qm = ⇒ 16

M2M/L2L ✓ z k ± 1 ◆ M 2 M ± qk = ` q 2 pqb = M 2 M qk M ` +1 M ` pk (2 b ) L ` +1 pkb + L ` +1 pq (2 b ) = L 2 L qk L ` pq (2 b ) Computed with single BatchedGEMM 17

S2T/M2L � π ( � cot N ( p + Pk ) p > 0 T pib = S 2 T p ( j − i ) S pjb S 2 T pk = p = 0 δ k 0 L ` pib = M 2 L ` pijs M ` ⇣ π 2 ` ( z j 2 − z i 2 + s ) + π ⌘ M 2 L ` pijs = cot N ( p + 1) pj ( b + s ) • Also Level-3 Linear Algebra computations, but no BLAS primitives. • CUSTOM KERNELS 18

INTERPOLATIVE FMM Operator Storage Compute P(4M L -1) 3P2 L M L2 QM L 2PMQ 2Q 2 4(2 L -2 B )PQ 2 4(L-B)PQ 2 3(2 L -2 B )PQ 2 2Q 2 4(2 L -2 B )PQ 2 QM L 2PMQ 19

ALGORITHM 20

PROFILE 21

FMM-FFT PROFILE Halo 2D FFT } S2M M2M S2T L2L L2T M2L 22

2xK40c FMM-FFT 23

2xP100 FMM-FFT 24

8xP100 FMM-FFT 25

FMM BREAKDOWN Components • T=ComplexDouble, A=2xP100 • B-GEMM and S2T dominate • Small N • Latency — Use 1 Level • Large N • Compute 26

EFFICIENCY • >95% BatchedGEMM • 60% S2T/M2L • >90% FMM-FFT 27

PARAMETER DEPENDENCE — M L Points per box per FMM • Trade #levels for S2T comp • Flop count not enough • Increase the intensity • Tune performance for M L =64 • T=Z, A=2xP100, N=2 27 , P=256, B=3, Q=16 28

PARAMETER DEPENDENCE — P Number of FMMs • Flops/Intensity approx constant • Trade #levels for #FMMs • Large P good • Fill up B-GEMM • More square 2D FFT • T=Z, A=2xP100, N=2 27 , M L =64, B=3, Q=16 29

PARAMETER DEPENDENCE — B Base Level • Not very significant • Scale to 128 GPUs w/o complications • T=Z, A=2xP100, N=2 27 , P=256, M L =64, Q=16 30

PARAMETER DEPENDENCE — Q Quadrature Order • Weak performance dependence • Accuracy tuning • T=Z, A=2xP100, N=2 27 , P=256, M L =64, B=3 31

FUTURE • Integration into CUFFT • Application to 2D/3D FFTs? • Convolutions • NUFFT , Sparse FFT • Volta predictions and measurements • Mixed precision (e.g. FP16 far-field) to use Tensor Core? • Persistent Matrix Batched GEMM (cuBLAS optimization) • Staged Persistent Matrix Batched GEMM (cooperative groups, RNNs) 32

CONCLUSION • FMM-FFT trades 2/3 communication in 1D FFT for P FMMs • Viable on highest comp:comm architecture available • Detailed implementation that relies heavily on existing primitives • Primitives >95% efficient • Two custom dense kernels >60% efficient • Entire FMM-FFT >90% efficient • Tunable accuracy-performance tradeoff • Compute model accurately predicts performance 33

May 8-11, 2017 | Silicon Valley THANK YOU

LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior - PowerPoint PPT Presentation

May 8-11, 2017 | Silicon Valley LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior Research Scientist. May 11, 2017 THE FAST FOURIER TRANSFORM Operation Count: 4 N log 2 N 6 N + 8 2 SPLIT-RADIX FFT Algorithm 3

The Fast Fourier Transform - FFT Sound Design and Interactive Music - FFT Learning Objectives

Efficient GPU parallelization of the Fast Multipole Method with periodic boundary conditions

MULTIPOLE EXPANSION 5.4.3 5.30 The leading term in the vector potential multipole

FFT Application Examples and Implementation FFT Example 1: Signal Sparsity in time Frequency

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

Fast Convolutions Via the Overlap- and-Save Method Using Shared Memory FFT Karel Admek , Sofia

The The Fast Fourier Transform Fast Fourier Transform Basic FFT Stuff That s Good to

Optimizing and Tuning the Fast Multipole Method for Multicore and Accelerator Systems Georgia

Differential Algebra (DA) based Fast Multipole Method (FMM) He Zhang, Martin Berz, Kyoko Makino

2DECOMP&FFT A Highly Scalable 2D Decomposition Library and FFT Interface Ning Li and

FFT analysis of DNA sequences Harvey Lab Group Meeting March 1, 2004 Russell Hanson 2 Nave

Fast Multipole Methods in Arbitrary Dimensions with Chenhan Yu James Levitt Severin Riez

Symmetry analysis and multipole classification of eigenmodes in electromagnetic resonators Sergey

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Empirical Comparisons of Fast Methods Dustin Lang and Mike Klaas { dalang, klaas } @cs.ubc.ca

LITT 2014: Video Solutions for Homework A Success Story (so far) Christopher K. Reed Math 128

Motivation of Japanese Citizens to Utilize International Carbon Crediting and Individual

Literature Review of Risks and Returns of Cryptocurrency by Liu and Tsyvinski, 2018 Jiawen Yan

in Ethiopia Presented by Tim Frankenberger , TANGO International December 2, 2014 Based on the

High Performance Computing on ARM C. Steinhaus C. Wedding christian.{wedding,

FACULTY OF MECHANICAL ENGINEERING PRESENTATION OUTLINE PRESENTATION OUTLINE INTRODUCTION

Bibliographie [1] J.L. Alperin and Rowen B. Bell. Groups and representations . Springer Verlag,

Sm art Tools for Sm arter Maintenance Leveraging Predictive Technologies to Optimize Your

LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior - PowerPoint PPT Presentation

May 8-11, 2017 | Silicon Valley LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior Research Scientist. May 11, 2017 THE FAST FOURIER TRANSFORM Operation Count: 4 N log 2 N 6 N + 8 2 SPLIT-RADIX FFT Algorithm 3

The Fast Fourier Transform - FFT Sound Design and Interactive Music - FFT Learning Objectives

Efficient GPU parallelization of the Fast Multipole Method with periodic boundary conditions

MULTIPOLE EXPANSION 5.4.3 5.30 The leading term in the vector potential multipole

FFT Application Examples and Implementation FFT Example 1: Signal Sparsity in time Frequency

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

Fast Convolutions Via the Overlap- and-Save Method Using Shared Memory FFT Karel Admek , Sofia

The The Fast Fourier Transform Fast Fourier Transform Basic FFT Stuff That s Good to

Optimizing and Tuning the Fast Multipole Method for Multicore and Accelerator Systems Georgia

Differential Algebra (DA) based Fast Multipole Method (FMM) He Zhang, Martin Berz, Kyoko Makino

2DECOMP&amp;FFT A Highly Scalable 2D Decomposition Library and FFT Interface Ning Li and

FFT analysis of DNA sequences Harvey Lab Group Meeting March 1, 2004 Russell Hanson 2 Nave

Fast Multipole Methods in Arbitrary Dimensions with Chenhan Yu James Levitt Severin Riez

Symmetry analysis and multipole classification of eigenmodes in electromagnetic resonators Sergey

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Empirical Comparisons of Fast Methods Dustin Lang and Mike Klaas { dalang, klaas } @cs.ubc.ca

LITT 2014: Video Solutions for Homework A Success Story (so far) Christopher K. Reed Math 128

Motivation of Japanese Citizens to Utilize International Carbon Crediting and Individual

Literature Review of Risks and Returns of Cryptocurrency by Liu and Tsyvinski, 2018 Jiawen Yan

in Ethiopia Presented by Tim Frankenberger , TANGO International December 2, 2014 Based on the

High Performance Computing on ARM C. Steinhaus C. Wedding christian.{wedding,

FACULTY OF MECHANICAL ENGINEERING PRESENTATION OUTLINE PRESENTATION OUTLINE INTRODUCTION

Bibliographie [1] J.L. Alperin and Rowen B. Bell. Groups and representations . Springer Verlag,

Sm art Tools for Sm arter Maintenance Leveraging Predictive Technologies to Optimize Your

2DECOMP&FFT A Highly Scalable 2D Decomposition Library and FFT Interface Ning Li and