May 8-11, 2017 | Silicon Valley LOW-COMMUNICATION FFT WITH FAST MULTIPOLE METHOD Cris Cecka, Senior Research Scientist. May 11, 2017
THE FAST FOURIER TRANSFORM Operation Count: 4 N log 2 N − 6 N + 8 2
SPLIT-RADIX FFT Algorithm 3
SPLIT-RADIX FFT Profile 4
FMM-FFT Edelman et al. 1999 5
STRUCTURED DENSE MATRICES AND FMM •SVD: A = U D V ∗ •Low-Rank: K = U ˜ K r × r V ∗ •Hierarchically LR: K IJ = U I ˜ K IJ V ∗ J •H-Semi-Separable: K IJ = U I ˜ I ˜ J ˜ V ∗ J V ∗ U ˜ K ˜ I ˜ ˜ J •H 2 -Matrix/FMM 6
FMM-FFT Algorithm M M,P = diag( I M , C 1 , . . . , C P − 1 ) ⇣ π n − m + p h ⇣ ⌘⌘ i [ C p ] mn = ρ p cot + ı M P } 2D M × P FFT 7
COT FMM ⇣ π n − m + p h ⇣ ⌘⌘ i [ C p ] mn = ρ p cot + ı M P • One dimensional • Uniform — integers are source/target • Periodic • Distributed • Size M-by-M • P of them! • Interleaved 8
FMM OPERATORS M2L B=2 M2L M2M Q L2L 3 M2L Q M2M L2L L2L L=4 Q L2T L2T S2M M/2 L S2T S2T • S: “Source” Each operator is an (implicit) matrix. • T: “Target” • M: “Multipole” • L: “Local” 9
PARAMETERS OF THE FMM-FFT • FFT N = M P • FMM ( N, P, M L , Q, B ) Q • Rank • Base level B • Leaf box size M L L = log 2 ( M/M L ) • Leaf level 10
DISTRIBUTED FMM All2All Gather Halo 2b Halo 2b Halo 1b All2All Gather Halo 2b Halo 2b Halo 1b 11
INTERPOLATIVE FMM ˜ ˜ ˜ ˜ C ij = ` m ( t I I I J n ) ` n ( s J J i ) ` q ( z m ) C ( z r ) ` r ( z i ) q , z M2L z − z k ✓ (2 j + 1) π ◆ Y ` i ( z ) = z j = cos z i − z k 2 Q 0 k<Q M2M L2L k 6 = i • Same operators across all boxes S2M L2T • Same operators across all levels • Almost same operators across all FMMs 12
TENSOR REPRESENTATIONS A ijk ` := A [ i + j ∗ ldA<1> + k ∗ ldA<2> + ` ∗ ldA<3> ] , • Input: S n ≡ S pm ≡ S pmb • Output: T n ≡ T pm ≡ T pmb 13
S2M/L2T s m = − 1 + 2 m + 1 S 2 M qm = ` q ( s m ) M L M L ( p − 1) qb = S 2 M qm S pmb Computed with single BatchedGEMM 14
BATCHED MATRIX-MATRIX MULTIPLY cublas<T>gemmStridedBatched in cuBLAS 8.0 15
S2M/L2T M pq [ b ] = S pm [ b ] S 2 M T M pqb = S 2 M qm S pmb = ⇒ qm T pmb = L 2 T mq L pqb T pm [ b ] = L pq [ b ] S 2 M qm = ⇒ 16
M2M/L2L ✓ z k ± 1 ◆ M 2 M ± qk = ` q 2 pqb = M 2 M qk M ` +1 M ` pk (2 b ) L ` +1 pkb + L ` +1 pq (2 b ) = L 2 L qk L ` pq (2 b ) Computed with single BatchedGEMM 17
S2T/M2L � π ( � cot N ( p + Pk ) p > 0 T pib = S 2 T p ( j − i ) S pjb S 2 T pk = p = 0 δ k 0 L ` pib = M 2 L ` pijs M ` ⇣ π 2 ` ( z j 2 − z i 2 + s ) + π ⌘ M 2 L ` pijs = cot N ( p + 1) pj ( b + s ) • Also Level-3 Linear Algebra computations, but no BLAS primitives. • CUSTOM KERNELS 18
INTERPOLATIVE FMM Operator Storage Compute P(4M L -1) 3P2 L M L2 QM L 2PMQ 2Q 2 4(2 L -2 B )PQ 2 4(L-B)PQ 2 3(2 L -2 B )PQ 2 2Q 2 4(2 L -2 B )PQ 2 QM L 2PMQ 19
ALGORITHM 20
PROFILE 21
FMM-FFT PROFILE Halo 2D FFT } S2M M2M S2T L2L L2T M2L 22
2xK40c FMM-FFT 23
2xP100 FMM-FFT 24
8xP100 FMM-FFT 25
FMM BREAKDOWN Components • T=ComplexDouble, A=2xP100 • B-GEMM and S2T dominate • Small N • Latency — Use 1 Level • Large N • Compute 26
EFFICIENCY • >95% BatchedGEMM • 60% S2T/M2L • >90% FMM-FFT 27
PARAMETER DEPENDENCE — M L Points per box per FMM • Trade #levels for S2T comp • Flop count not enough • Increase the intensity • Tune performance for M L =64 • T=Z, A=2xP100, N=2 27 , P=256, B=3, Q=16 28
PARAMETER DEPENDENCE — P Number of FMMs • Flops/Intensity approx constant • Trade #levels for #FMMs • Large P good • Fill up B-GEMM • More square 2D FFT • T=Z, A=2xP100, N=2 27 , M L =64, B=3, Q=16 29
PARAMETER DEPENDENCE — B Base Level • Not very significant • Scale to 128 GPUs w/o complications • T=Z, A=2xP100, N=2 27 , P=256, M L =64, Q=16 30
PARAMETER DEPENDENCE — Q Quadrature Order • Weak performance dependence • Accuracy tuning • T=Z, A=2xP100, N=2 27 , P=256, M L =64, B=3 31
FUTURE • Integration into CUFFT • Application to 2D/3D FFTs? • Convolutions • NUFFT , Sparse FFT • Volta predictions and measurements • Mixed precision (e.g. FP16 far-field) to use Tensor Core? • Persistent Matrix Batched GEMM (cuBLAS optimization) • Staged Persistent Matrix Batched GEMM (cooperative groups, RNNs) 32
CONCLUSION • FMM-FFT trades 2/3 communication in 1D FFT for P FMMs • Viable on highest comp:comm architecture available • Detailed implementation that relies heavily on existing primitives • Primitives >95% efficient • Two custom dense kernels >60% efficient • Entire FMM-FFT >90% efficient • Tunable accuracy-performance tradeoff • Compute model accurately predicts performance 33
May 8-11, 2017 | Silicon Valley THANK YOU
Recommend
More recommend