single processor optimization ii
play

Single Processor Optimization (II) Russian-German School on High - PowerPoint PPT Presentation

Single Processor Optimization (II) Russian-German School on High Performance Computer Systems, June, 27 th until July, 6 th 2005, Novosibirsk 2. Day, 28 th of June, 2005 HLRS, University of Stuttgart Slide 1 High Performance Computing Center


  1. Single Processor Optimization (II) Russian-German School on High Performance Computer Systems, June, 27 th until July, 6 th 2005, Novosibirsk 2. Day, 28 th of June, 2005 HLRS, University of Stuttgart Slide 1 High Performance Computing Center Stuttgart

  2. Intention • different kind of declarations of arrays are tested • overhead of procedure calls • overhead for leaving procedures • allocation/deallocation times • performance implications of declarations for operating with arrays Slide 2 High Performance Computing Center Stuttgart

  3. tested machines and compilers Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi 1000 500 2400 2400 38 31 92 112 37 29 88 88 1. machine type 2. frequency in MHz 3. number of clock tics for setting the next clock tic 4. number of tics for calling overhead for one clock call Slide 3 High Performance Computing Center Stuttgart

  4. Compiler • Intel IA_32 ifc -O3 -nodps -hlo -tpp6 ( or tpp7 ) • Intel IA-64 efc -O3 -hlo -opt_report -opt_report_levelmax -opt_report_phaseall # -ip # -S • Portland Group (may be better options) pgf90 -fast • NEC SX f90 notice that nonoverlapping pointers is default Slide 4 High Performance Computing Center Stuttgart

  5. test environment • simplest procedures should allow for best optimization • slim instruction body shows the calling overhead • timers are based on hardware counters called by assembler routines – calling overhead 30 - 90 cycles – PAPI is too inaccurate • programs are portable as long as hardware counters can be provided Slide 5 High Performance Computing Center Stuttgart

  6. how to call IA- 32 counter • IA-32 Linux; assembler embedded in C; also pgi; also AMD • icc -c clock_tic.c unsigned long long int clock_tic_ () { unsigned long long int x; __asm__ volatile ("rdtsc\n" : "=A" (x)); return x;} integer(kind=8) :: int_start integer(kind=8) :: int_end integer(kind=8),external :: clock_tic int_start=clock_tic() do ii=1,imax a(ii)=b(ii)+c(ii) enddo int_end=clock_tic() Slide 6 High Performance Computing Center Stuttgart

  7. how to call IA- 64 counter .text .align 16 • IA-64 Linux Assembler // C version • ecc -c clock_tic.s // long clock_tic() .global clock_tic# .proc clock_tic# integer(kind=8) :: int_start clock_tic: integer(kind=8) :: int_end mov r8 = ar.itc integer(kind=8),external :: clock_tic br.ret.sptk.many b0 int_start=clock_tic() .endp clock_tic# do ii=1,imax a(ii)=b(ii)+c(ii) .align 16 enddo // Fortran version int_end=clock_tic() // integer*8 clock_tic .global clock_tic_# .proc clock_tic_# clock_tic_: mov r8 = ar.itc br.ret.sptk.many b0 .endp clock_tic_# Slide 7 High Performance Computing Center Stuttgart

  8. how to call NEC SX counter • NEC SX usr time counter • as -dl clock_tic.s global clock_tic_ clock_tic_: stusrcc $s123 integer(kind=8) :: int_start b 0(,$s32) integer(kind=8) :: int_end integer(kind=8),external :: clock_tic NEC SX wall clock counter int_start=clock_tic() • as -dl clock_tic_wall.s do ii=1,imax a(ii)=b(ii)+c(ii) global clock_tic_wall_ enddo clock_tic_wall_: int_end=clock_tic() ststm $s123 b 0(,$s32) Slide 8 High Performance Computing Center Stuttgart

  9. Part 1: procedure calls and declarations • detailed measurements entering and leaving procedures • allocation/deallocation, automatic array timings • tested are procedures in a module – in the same file – and in a different file • allocation of large number of pointers • simple recursive procedures Slide 9 High Performance Computing Center Stuttgart

  10. measuring methodology subroutine measuring environment explicit_shape_array(array,ix,iy) integer :: ix,iy int_1=clock_tic() real(kind=8),dimension(ix,iy) :: array do nn=1,nmax ! repetition loop int_3=clock_tic() int_2=clock_tic() array(1,1) = 0. call extern_explicit_shape_array(array,ix,iy) int_4=clock_tic() int_5=clock_tic() end subroutine explicit_shape_array int_time_23=int_time_23+(int_3-int_2) measured procedure int_time_34=int_time_34+(int_4-int_3) int_time_45=int_time_45+(int_5-int_4) enddo calculation of timings int_6=clock_tic() int_time_16=int_6 - int_1 time_array(1)=real(int_time_23)/real(nmax) - tics_for_calling_clock time_array(2)=real(int_time_34)/real(nmax) - tics_for_calling_clock time_array(3)=real(int_time_45)/real(nmax) - tics_for_calling_clock Slide 10 High Performance Computing Center Stuttgart time_array(4)=int_time_16 - tics_for_calling_clock

  11. interpretation of tables 1- machine 2-4 procedure in the same file ( enter, body, leaving ) 5-7 procedure in a different file ( enter, body, leaving ) Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 7 43 4 12 entering 1 5 0 4 body 3 8 0 4 leaving 7 40 4 12 entering 1 5 0 4 body 43 45 88 96 leaving Slide 11 High Performance Computing Center Stuttgart

  12. implicit_procedure subroutine implicit_procedure2(j,time) integer :: j real(kind=8) :: time j=max(j,int(time)) !only to confuse the compiler end subroutine implicit_procedure2 procedure in an external file; total time for one call Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 21 42 24 32 call Slide 12 High Performance Computing Center Stuttgart

  13. explicit_shape_array subroutine explicit_shape_array(array,ix,iy) integer :: ix,iy real(kind=8),dimension(ix,iy) :: array short time for leaving the procedure in the case int_3=clock_tic() procedure is in the same array(1,1) = 0. file int_4=clock_tic() end subroutine explicit_shape_array Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 7 43 4 12 entering 1 5 0 4 body 3 8 0 4 leaving 7 40 4 12 entering 1 5 0 4 body 43 45 88 96 leaving Slide 13 High Performance Computing Center Stuttgart

  14. assumed_shape_array subroutine assumed_shape_array(array) real(kind=8),dimension(:,:) :: array int_3=clock_tic() array(1,1) = 0. needs much more time for entering int_4=clock_tic() and leaving procedure end subroutine assumed_shape_array 11 184 8 745 entering 13 7 0 0 body 83 82 180 228 leaving 10 183 8 752 entering 13 7 0 0 body 123 119 272 360 leaving Slide 14 High Performance Computing Center Stuttgart

  15. assumed_shape_array_section 1 subroutine assumed_shape_array_section(array_1,array_2,digit,ix,iy) real(kind=8),dimension(:,:) :: array_1 real(kind=8),dimension(:,:) :: array_2 real(kind=8) :: digit integer :: ix,iy int_3=clock_tic() array_1(ix,iy) = digit; array_2(ix,iy) = 2.; digit = array_1(ix,iy) + array_2(ix,iy) int_4=clock_tic() end subroutine assumed_shape_array_section three different cases for actual parameters: call assumed_shape_array_section(array_1,array_2,digit,a,b) call assumed_shape_array_section(array_1(1:a,1:b),array_2(1:a,1:b),digit,a,b) call assumed_shape_array_section(array_1(1:a:2,1:b:2),array_2(1:a:2,1:b:2),digit,a,b) Slide 15 High Performance Computing Center Stuttgart

  16. assumed_shape_array_section 2 1) call assumed_shape_array_section(array_1,array_2,digit,a,b) 2) call assumed_shape_array_section(array_1(1:a,1:b),array_2(1:a,1:b),digit,a,b) 3) call assumed_shape_array_section(array_1(1:a:2,1:b:2),array_2(1:a:2,1:b:2),digit,a,b) Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 17 377 24 1494 entering 43 27 52 25 body 7 8 4 96 leaving 52 1302 85 2146 entering 40 29 50 22 body 7 8 4 96 leaving 56 1183 84 5997 entering 30 27 36 25 body 51 45 96 4193 leaving copying when entering the procedure for cases 2 and 3 Slide 16 High Performance Computing Center Stuttgart

  17. deferred_shape_array subroutine deferred_shape_array(digit,x,y) real(kind=8),allocatable,dimension(:,:) :: array_1 real(kind=8),allocatable,dimension(:,:) :: array_2 high times for allocation and deallocation real(kind=8) :: digit large times for leaving the integer :: x,y procedure int_3=clock_tic() allocate (array_1(x,y),array_2(x,y)) array_1(x,y) = digit; array_2(x,y) = 2. ; digit = array_1(x,y) + array_2(x,y) deallocate (array_1,array_2) int_4=clock_tic() end subroutine deferred_shape_array Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 15 218 4 16 entering 682 2405 1036 3218 body 1642 1707 3611 4485 leaving 15 219 4 16 entering Slide 17 High Performance Computing Center Stuttgart 682 2391 1051 3297 body 1719 1962 3709 4577 leaving

  18. automatic_arrays subroutine automatic_arrays(digit,ix,iy) real(kind=8),dimension(ix,iy) :: array_1,array_2 real(kind=8) :: digit NEC SX is quite fast integer :: ix,iy pgi is much worse int_3 = clock_tic() much better solution as array_1(ix,iy) = digit; array_2(ix,iy) = 2. allocation and deallocation digit = array_1(ix,iy) + array_2(ix,iy) int_4 = clock_tic() end subroutine automatic_arrays Itanium2_efc NECSX5_f90 Pentium4_ifc Pentium4_pgi machine 372 208 500 1369 entering 15 26 36 16 body 284 78 269 732 leaving 372 206 500 1349 entering 15 26 36 16 body Slide 18 598 High Performance Computing Center Stuttgart 172 625 1493 leaving

Recommend


More recommend