Tsunami simulation on FPGA/GPU Tsunami simulation on FPGA/GPU and its analysis based on Statistical Model Checking Masahiro Fujita VLSI Design and Education Center (VDEC) University of Tokyo 1
2 Outline • What Tsunami simulation means in this talk • Acceleration with FPGA/GPU • Acceleration with FPGA/GPU – Based on stream processing (pipelining) with loop unrolling unrolling – Based on parallel processing for decomposed regions • (Formal verification of those implementation) • (Formal verification of those implementation) – (Equivalence checking between FPGA/GPU implementation and the original program in C/Fortran) implementation and the original program in C/Fortran) – Just show our strategy • Statistical model checking Statistical model checking – On software in Fortran – Acceleration with FPGA/GPU l i i h PG /GP
3 Motivation • Based on the values of many earthquake sensors (wired/wireless) compute how Tsunami wave will (wired/wireless), compute how Tsunami wave will propagate • Goal: Realize supercomputer level performance in G l R li l l f i Tsunami simulation with FPGA/GPU Compute/Predict C /P di Tsunami as fast and accurate as possible accurate as possible Earthquake sensors Generate initial wave from sensor data geographically distributed Propagate wave by numerically solving partial differential equations
4 Tsunami simulation • Tsunami simulation algorithm: Find solutions of fluid dynamics equations dynamics equations – Law of Conservation of Mass – Law of Conservation of Momentum with and without bottom friction • Solved with known boundary conditions and bathymetric input of the region y p g • Here the above is processed by numerically solving sets of partial differential equations with finite sets of partial differential equations with finite difference methods
5 Partial differential equations to be solved Partial differential equations to be solved ὴ = vertical displacement of water above still water ὴ p D= Total water depth = h+ ὴ Momentum g = Acceleration due to gravity equations along A = horizontal eddy viscosity current A h i l dd i i X ‐ axis and Y ‐ axis τ = friction along x or y direction respectively M = water flux discharge along X direction M = water flux discharge along X direction without bottom ith t b tt N = water flux discharge along Y direction friction Mass conservation Mass conservation Reference: Tsunami Modeling Manual by Prof Nobuo Shuto
6 Here we use a simplified model: Linear one (valid if sea depth is large enough) • Shallow Water Theory (Long Wave Theory) M M N (Mass Conservation) (Mass Conservation) 0 0 t x y 2 2 M M MN gn M 2 2 gD M N 0 7 (Momentum Conservation) (Momentum Conservation) t x D y D x D D 3 2 2 M MN N gn M 2 2 gD M N 0 7 t x D y D y D 3 : waveheight g D : depth p g g : g gravity y n : Manning g M , , N : flaxofx f f , , y y • Linear Long Wave Theory (Mass Conservation) (M (Momentum Conservation) t C ti )
7 Finite difference methods • Solution of mass conservation equation based on finite difference method finite difference method – Z(i,j,t+1) = Z(I,j,t) – (dt/dx)*(M(I,j,t)-M(i-1,j,t)+N(i,j,t)-N(i,j-1,t)) M M M M N N 0 Where t t x y i, j = x, y coordinate Z(i,j,t) = Water Surface level at time t H(i,j,t) = Still water depth at time t t+1 dt = temporal step dx = spatial step M(i,j,1) = water flux discharge along x-axis at time t N(i,j,1) = water flux discharge along y-axis at time t Z(i,j,2) = water surface level at time t+dt
8 Wave Height Computation Mass conservation Momentum conservation (Wave height update) (Wave height update) ・・・ ・・・ ・・・ ・・・ N me T M,N, M,N, ・・・ ・・・ ・・・ ・・・ M M H H Tim H,Z H ・・・ ・・・ ・・・ ・・・ H 1 second 1 second ・・・ ・・・ ・・・ ・・・ T+1 Time T ・・・ ・・・ ・・・ ・・・ Z Z M,N ・・・ ・・・ ・・・ ・・・ ・・・ ・・・ ・・・ ・・・ Z Z 8
9 Target Tsunami Simulator g • “TUNAMI N1” program in FORTRAN – Developed by Tohoku University D l d b T h k U i i Standing Water Earthquake Depth (H) Depth (H) h ( ) Source Info. Source Info. f Initial Wave Height (Z0) Computation Mass Conservation d) ht Update e = 1 secon Computation (Wave height: Z) 1040 grids 1040 grids ave Heigh eration = Momentum Standing water depth H Conservation (1 grid: 1km x 1km) Computation p Wa (1 it (Momentum: M,N) 9
10 C Implementation (base program) p p g • Mass and Momentum functions are computed alternatively alternatively – Each function raster ‐ scans the grids • Since there is no data dependency between the h b h computations at grids, they can be parallelized Mass Conservation (Z update) Momentum Conservation (M,N update) No dependency No dependency (Parallelizable) (Parallelizable) 10
11 Speed of N1 simulation program • Original Fortran program has been manually converted into C – C is 4 times faster than Fortran in our environment – C version is the base simulator • Size of simulation area • Grid width: 1[km] • Numbers of grids: 1040*668 • Simulated time • 1 time step = 1 sec • 7,200 steps computed (2 hours) • Tsunami simulation time on Intel microprocessor l l (i7@2.93GHz, single core) → 78 7 → 78.7 sec
12 Profiling of Computation time of the software Profiling of Computation time of the software Start Simulation Cycles and Computation time of TUNAMI Input p 160 Initial condition 140 sec] 120 tion time[s Main Loop t=1 ; t<=T ; t++ 100 moment conservation 80 Mass Conservation Mass Conservation open boundary open boundary Computat 60 mass conservation Open Boundary Condition initial 40 M Momentum Conservation t C ti 20 0 3600 7200 10800 14400 18000 21600 Simulation Cycles (t) Simulation Cycles (t) Stop 12
13 Co-execution C C on microprocessor i FPGA/GPU FPGA/GPU • Read earthquake information from files Mass Conservation • Compute initial wave Open Boundary Condition Open Boundary Condition • Load initial wave to DRAM Momentum Conservation memory of FPGA board • Run FPGA . . . • Read data from DRAM Idl Idle . • Main loop . . • Store results to DRAM • Store results into files
14 Pipeline processing for higher throughput • Latency – After receiving input, how many cycles are required to generate its output generate its output • Throughput – How frequently input data can be processed Iter 1 Iter 2 Iter 3 Iter N Iteration Pipeline processing Pipeline processing Time Iter N latency y Iter 3 Pipeline stages Iter 2 Goal: Faster throughput Iteration Iter 1 = Larger numbers of pipeline stages Larger numbers of pipeline stages Initiation interval (~throughput)
15 Typical way for larger pipelines • Usually each loop becomes one pipeline – Multiple loops should be merged as much as possible • Number of pipeline stages depends on the length of each iteration – Better to have larger loops Better to have larger loops • Various loop optimizations have been proposed – Formal analysis becomes possible with such transformations y p for ( i=0; i<N; i++) for (t1=0; t1<N; t1++) for (t2=0; t2<N; t2++) { for ( j=0; j<N; j++) S0(j,i,0)A[t2,t1] += u[t2]*v[t1]; S0(0,i,j): A[ i ][ j ] += u[ i ] * v[ j ]; S1(i j 1) [t1] S1(i,j,1)x[t1] += A[t2,t1]*y[t1]; A[t2 t1]* [t1] i’ i S0(i,j) S1(i’,j’) for ( i=0; i<N; i++) } for ( j=0; j<N; j++) S1(1,i,j): x[ i ] += A[ j ][ i ] * y[ j ]; j’ j [Pluto 08] U. Bondhugula, et al. “"A Practical and Automatic Polyhedral Program Optimization System,” in ACM PLDI'08, 2008
16 Transforming latency-based to throughput-based computation • Stream based programming • Stream based programming – Communication/buffering becomes explicit – Easier for formal analysis as well – Easier for formal analysis as well • Works for both FPGA and GPU – And also for many-cores – And also for many-cores Instruction Data memory memory memory memory P1 µP P2 P3 Hardware “instruction, latency-based” “data, throughput-based”
17 Introducing streams • Steam • Steam = Sequence of data, functions, or combined Sequence of data functions or combined … M. N. H. Z t Formal equivalence One stream for One stream for t+1 t+1 checking with h ki i h each region manipulation of transformation matrix One stream for … each time step Data object b Original (can be a single program code stream)
18 Strategy for GPU implementation • As shown earlier, stream is based on e each region each region Tim – Easier and more efficient for GPU – But depend on memory access B t d d architecture of the target GPU systems • Essentially area where Tsunami should be simulated is decomposed into a set of small regions d d f ll – Each core of GPU is in charge of one region – Straight forward parallelism – Pipelined computation inside each core
Recommend
More recommend