Analysis of a Parallel 3D MD application Russian-German School on High-Performance Computer Systems, 27 th June - 6 th July, Novosibirsk 8. Day, 6 th of July, 2005 Inst. of Theoretic & Applied Mechanics, Novosibirsk HLRS, University of Stuttgart Höchstleistungsrechenzentrum Stuttgart
Overview of the talk • Overview • Background of application • Numerical scheme • Grid setup and communication Optimization experiments • Scalability of code • • Summary Höchstleistungsrechenzentrum Stuttgart
Background of the application • Computation of the ignition of condensed material and • Transition from burning to detonation phase. Here a 3D-(mono)crystal (AgN 3 ) and check applicability of most • general non-stationary continuum mechanics equations. • Interactions between molecules in crystal are used with the two-body term of Stillinger-Weber potential: inter / r ij − a inter 4 − 1 e U inter = e inter b inter / r ij with e intra =5x10 -21 J, b inter =9,7Å, a inter =3,9Å, and σ inter =0.007Å. Höchstleistungsrechenzentrum Stuttgart
Numerical scheme • The evolution of the system over time is described as: k k k 1 = k • Position in space: r i r i p i F i m i 2 k 2 k k 1 = k 1 • Momentum: p i p i F i F i with F k i being the total force acting onto this i -th atom from all other • atoms. • Actually not all other atoms, but rather a short-range interaction with a cut-off radius. Höchstleistungsrechenzentrum Stuttgart
Grid setup • The computational domain is then decomposed: MPI-procs 0 1 2 3 Y X • in the X-direction parallelized using MPI. • As we have short-range interaction, computational domain is organized in cells: 1 2 Höchstleistungsrechenzentrum Stuttgart
Grid communication 1/2 One cell interaction → one column ghost cell. • 1 2 3 16 1 13 3 13 13 14 15 16 16 12 1 9 3 9 9 10 11 12 12 8 1 5 3 5 5 6 7 8 8 4 1 1 3 1 1 2 3 4 4 • Actually several information to be exchanged: – Number of atoms in ghost cells to send off. – Position r i and p i of i -th atom in ghost cells. • After recalculation of energy of each cell: – Number of atoms moving over boundary and migrate . Höchstleistungsrechenzentrum Stuttgart
Grid communication 2/2 • Due to Migration, the amount of data exchanged changes: Höchstleistungsrechenzentrum Stuttgart
Running the code 1/2 • Code (F77) immediately portable to new platform, perfect . • Small buglet on 64-Bit platform: program test5dd implicit DOUBLE PRECISION (a - h, o - z) include 'mpif.h' DOUBLE PRECISION, allocatable::rm(:),pm(:) DOUBLE PRECISION, allocatable::rm_myleft(:),pm_myleft(:) .... c--------------------CLOCKWISE-------------------------------- c-----------------Right Interchange--------------------------- Nri=number_my+1-number_total*int((number_my+1)/(number_total)) call MPI_SEND(rnmyright,1,MPI_DOUBLE_PRECISION,Nri,1, * MPI_COMM_WORLD,ierr) c------------------------Left Interchange--------------------- Nli=number_my-1+number_total*int((number_total-1-(number_my- 1))/(number_total)) call MPI_RECV(rnnleft,1,MPI_DOUBLE_PRECISION,Nli,1, * MPI_COMM_WORLD,status,ierr) Höchstleistungsrechenzentrum Stuttgart
Running the code 2/2 • Small buglets: – The implicit double precision (a-h, o-z) is neat, but MPI defines opaque types to be integer; MPI_Status needs to be declared as: integer :: status(MPI_STATUS_SIZE) – (known issue) Why did the Number of atoms to be send to left/right neighbor have to be send as double? – Calculating left/right neighbor once may be better: Nri=mod(number_my+1, number_total) Nli=mod(number_my-1+number_total, number_total) – Possible problem with several functions using: implicit real*8 (a-h, o-z) – Code is depending on using Eager-protocol for sending msg: MPI_SEND(rnmyright,1,MPI_INTEGER,Nri,1,MPI_COMM_WORLD,ierr) MPI_RECV(rnnleft,1,MPI_INTEGER,Nli,1,MPI_COMM_WORLD,status,ie rr) Need to either use MPI_Sendrecv or non-blocking interchange! Höchstleistungsrechenzentrum Stuttgart
Optimization experiments 1/4 • One issue already discussed: call MPI_Gather(Eknode,1,MPI_DOUBLE_PRECISION,Ektot_array,1, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) call MPI_Gather(Wsurnode,1,MPI_DOUBLE_PRECISION,Wsur_array,1, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) call MPI_Gather(UNode,1,MPI_DOUBLE_PRECISION,UNode_array,1, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) if (number_my.eq.0) then Ektot=0. Wsurtot=0. Utot=0. do i=1,number_total Ektot=Ektot+Ektot_array(i) Wsurtot=Wsurtot+Wsur_array(i) Utot=Utot+UNode_array(i) end do end if Höchstleistungsrechenzentrum Stuttgart
Optimization experiments 2/4 May be replaced by single call to MPI_Reduce : • reduce_in_array(1) = EKnode reduce_in_array(2) = Wsurnode reduce_in_array(3) = UNode call MPI_Reduce(reduce_in_array, reduce_out_array, 3, * MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr) if (number_my.eq.0) then EKtot=reduce_out_array(1) Wsurtot=reduce_out_array(2) Utot=reduce_out_array(3) end if • This reduces – 3 collective operations plus – 3 allocations / deallocations dependent on number_total. – On strider this reduces MPI-time from 11 to 9 seconds (8 procs). Höchstleistungsrechenzentrum Stuttgart
Optimization experiments 3/4 • Several messages are send to next neighbor. Integrate that into one one message using MPI_Pack : • call MPI_PACK_SIZE (2 * nmyright3, MPI_DOUBLE_PRECISION, * MPI_COMM_WORLD, buffer_pack_size, ierr) if (buffer_pack_size > send_buffer_size) then if(send_buffer_size>0) then deallocate (send_buffer) end if send_buffer_size = 2 * buffer_pack_size allocate (send_buffer (send_buffer_size)) end if buffer_pos=0 call MPI_PACK(rm_myright, nmyright3, MPI_DOUBLE_PRECISION, * send_buffer,send_buffer_size,buffer_pos,MPI_COMM_WORLD, ierr) call MPI_PACK(pm_myright, nmyright3, MPI_DOUBLE_PRECISION, * send_buffer, send_buffer_size, buffer_pos,MPI_COMM_WORLD, ierr) call MPI_ISEND(send_buffer, buffer_pos, MPI_PACKED, * Nri, 7, MPI_COMM_WORLD, req(1), ierr) Höchstleistungsrechenzentrum Stuttgart
Optimization experiments 4/4 • This actually is slower: execution time up from 250s to 280s. • Other way solution to try: using simple memcpy. • Currently testing: using separate non-blocking messages. (if we get nodes on strider). • Other option to find bottlenecks: Using Vampir or Mpitrace&Paraver. Höchstleistungsrechenzentrum Stuttgart
Scalability on MVS-1000 t,s 5000 4186.2 4000 Performance of 3F code. 3238.4 -Inverse of Processors 3000 -From 3 to 10 procs. 2683.0 -Linear Scalability. 2069.2 2000 -50000 atoms 1861.3 -5000 iterations 1713.2 1578.9 -On MVS-1000 using Intel P3-800 Data by A.V.Utkin. 1000 0.1 0.15 0.2 0.25 0.3 0.35 1 P Höchstleistungsrechenzentrum Stuttgart
Scalability on AMD-Opteron • Measurements done using original code (buglets fixed): Procs Total Time MPI-time 2 1062,7 11,69 4 486,4 15,82 8 248,8 2,71 12 174,6 20,13 16 145,6 20,99 Measurements done w/ new version ( MPI_Gather +Non-block): • Höchstleistungsrechenzentrum Stuttgart
Outlook • Allow better scalability by hiding communication by computation: – iff possible, as yet no function has been discovered to split and separately compute the ghost cells before the internal domain. • Use MPI_Sendrecv in more places. • Dynamically shift domain boundaries to get rid of load-imbalance (which is the main reason for looking at the code): MPI-procs 0 1 2 3 Y X Höchstleistungsrechenzentrum Stuttgart
Acknowledgements Thanks to Dr. Andrey V. Utkin for providing his code, papers and explaining the code. Höchstleistungsrechenzentrum Stuttgart
Recommend
More recommend