Parameter Tuning of a Hybrid Treecode-FMM on GPUs Rio Yokota, Lorena Barba Department of Mechanical Engineering, Boston University Saturday, June 4, 2011
Previous Calculations N=3x10 9 : 6 sec (Yokota & Barba) N=3x10 9 : 20 sec 40 TFlops FMM Turbulence (Nitadori & Hamada) 100 TFlops Nagasaki University Treecode Astrophysics Power Consumption : 100 kW Theoretical Peak : 700 TFlops GTX295 x 380 = 760 GPUs DEGIMA cluster Saturday, June 4, 2011
Treecode & FMM Saturday, June 4, 2011
Multipole expansion i 1 1 1 = * j � � 1 − x j − x ∗ x i − x j x i − x ∗ x i − x ∗ and Taylor expansion p − 1 1 � t k 1 − t = k =0 gives N 1 � do i = 1,N do k = 1,p x i − x j j =1 ff = 0 gg = 0 do j = 1,N do j = 1,N � p − 1 N � k � ff = ff+1/ ( x( i )-x( j ) ) gg = gg+( x( j ) - xs )**( k-1 ) 1 � x j − x ∗ � � end do end do = f( i ) = ff g(k) = gg x i − x ∗ x i − x ∗ end do end do j =1 k =0 do i = 1,N ff = 0 p − 1 N do k = 1,p � � ( x i − x ∗ ) − k − 1 ( x j − x ∗ ) k ff = ff+( x( i )-xs )**( -k )*g( k-1 ) = end do end do k =0 j =1 Saturday, June 4, 2011
Error Control Treecode FMM Complexity Error O ( θ p ) O ( p 3 θ − 3 ) Saturday, June 4, 2011
Error Optimization Optimize p Better on GPUs? Optimize θ Saturday, June 4, 2011
Stack based hybrid treewalk target source push pop push pop push Saturday, June 4, 2011
Parameter study Complexity Error O ( p 3 θ − 3 ) O ( θ p ) Saturday, June 4, 2011
Optimum parameters Saturday, June 4, 2011
Parallel calculation Nagasaki University DEGIMA 760 GTX295 GPUs Peak : 0.7 PFlops Tokyo Institute of Technology TSUBAME 2.0 4224 M2050 GPUs Peak : 2.4 PFlops Saturday, June 4, 2011
Strong Scaling N=10 8 400 400 DEGIMA TSUBAME 2.0 350 350 300 300 tree construction 250 250 mpisendp2p time x N procs [s] time x N procs [s] mpisendm2l P2Pkernel 200 200 P2Mkernel M2Mkernel M2Lkernel L2Lkernel 150 150 L2Pkernel 100 100 50 50 0 0 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 N procs N procs Saturday, June 4, 2011
Weak Scaling "& 0.5 PFlops TSUBAME 2.0 #$ #& :.--/30A4@.*3@;0A :;+-/54-39 678/30++*A;3=@;0A ($ C7D/30++*A;3=@;0A BCC/-?=>*=@;0A <03=>/-?=>*=@;0A (& $ & ! "# #$% #&!' )*+,-./01/2.03-44-4/567849 Saturday, June 4, 2011
Summary & Outlook 1. The stack based treewalk enables a simple but effective hybridization of treecodes and FMMs 3. More tests need to be performed in the higher accuracy range and the overall performance must be compared to other treecodes and FMMs 2. The optimum p and are different on CPUs and GPUs, but this difference is small θ Saturday, June 4, 2011
Recommend
More recommend