evaluation and optimization of multicore performance
play

Evaluation and Optimization of Multicore Performance Bottlenecks in - PowerPoint PPT Presentation

Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1 , Martin Burtscher 2 , John D. McCalpin 3 , Byoung-Do Kim 3 , Stephen W. Keckler 1,4 , James C. Browne 1 1 University of Texas, 2 Texas


  1. Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1 , Martin Burtscher 2 , John D. McCalpin 3 , Byoung-Do Kim 3 , Stephen W. Keckler 1,4 , James C. Browne 1 1 University of Texas, 2 Texas State, 3 Texas Advanced Computing Center, 4 NVIDIA 1

  2. Trends In Supercomputers !",&!)-&./,)(0"1,/#)(*&2&'"()&'"/-#& '!!$!!!" &#!$!!!" &!!$!!!" !"#$%&'"()*& ,-."/0123" %#!$!!!" 456"/0123" %!!$!!!" ,78"/0123" #!$!!!" !" &!!(" &!!)" &!!*" &!!+" &!%!" &!%%" +)$(& 2

  3. !"#$%&'()"#*+",+-(.)/"%0+!1&2*+$"+ 345677+89:;<+=>>?@A7=7+ $!!"# -../0/123/4#5673/87# ,!"# $)9:;1/#:<=# +!"# $%9.;1/#:<=# *!"# >.32.;1/# )!"# Is multicore ?/@2.;1/# (!"# an issue? AB24#:;1/# '!"# CB20#:;1/# &!"# 5DEF0/#:;1/#:<=# %!"# GB0HI0/#:JDI7#</1#:<=# $!"# 8D@/4# !"# $,,'# $,,(# $,,*# $,,+# %!!!# %!!$# %!!&# %!!'# %!!)# %!!*# %!!,# %!$!# ;3J/1# 3

  4. The Problem: Multicore Scalability 123-42,$#&!(45627&/#-8,-942(#& 95)#1-78"&!-*+85:&6#1;(1<*5-#& 95)#1-78"&!-*+85:&6#1;(1<*5-#& #,!!!" #,!!!" '%" +!!" +!!" '$" *!!" *!!" ""-./012"3400564" '#" ""-./012"3400564" )!!" )!!" "7018"391:./;" '!" (!!" (!!" !"##$%"& !"##$%"& !"##$%"& "7018"391:./;" 3<2=/;"391:./;" '!!" '!!" &" &!!" &!!" %" %!!" %!!" $" $!!" $!!" #" #!!" #!!" !" !" !" !" !" '!!" '!!" #!!!" #!!!" !" #" $" %" &" '!" '#" '$" '%" '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"& '()*#&+,-#.&/#-&0,$#& 4

  5. The Problem: Multicore Scalability 234-5(01"&!(56137&/#-8,-953(#& 95)#1-78"&!-*+85:&6#1;(1<*5-#& 95)#1-78"&!-*+85:&6#1;(1<*5-#& #,!!!" #,!!!" &" +!!" +!!" *!!" *!!" ""-./012"3400564" %" ""-./012"3400564" )!!" )!!" "7018"391:./;" (!!" (!!" !"##$%"& !"##$%"& !"##$%"& "7018"391:./;" 3<2=/;"391:./;" '!!" '!!" $" &!!" &!!" %!!" %!!" #" $!!" $!!" #!!" #!!" !" !" !" #" $" %" &" !" !" '!!" '!!" #!!!" #!!!" '()*#&+,-#.&/#-&+01"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"& 5

  6. Optimizations Differ in Multicore '/#%01)#%2#)%034"% !"#$%&"$'(%)'$%&*+,% $!!"# ('!"# !"##$%&"%'(#)%*+,#%-%'.% (&!"# (%!"# ($!"# (!!"# '!"# &!"# %!"# $!"# !"# # # # # ( ( ( ( $ / $ / ' ) ' ) . . . . & & & & - - - - % % % % , , . . $ $ & & + + # # 1 1 + + * * " " 0 0 * * ) ) ! ! Base ¡code ¡vs ¡Mul-core ¡Op-mized ¡code ¡ 6

  7. Paper Contributions l Studies multicore related bottlenecks l Identifies performance measurement challenges unique to multicore systems l Presents systematic approach to multicore performance analysis l Demonstrates principles of optimization 7

  8. Talk Outline l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion 8

  9. Approach: An HPC Case Study l Examine a real HPC application ¡ Major functions add variety l What is a typical HPC application? ¡ Many exhibit low arithmetic intensity l Typical of explicit / iterative solvers, stencils l Finite volume / elements / differences l CFD, Molecular dynamics, particle simulations, graph search, Sparse MM, etc. 9

  10. Approach: An HPC Case Study l Application: HOMME ¡ High Order Method Modeling Environment ¡ 3-D Atmospheric Simulation from NCAR ¡ Required for NSF acceptance testing ¡ Excellent scaling, highly optimized ¡ Arithmetic Intensity typical of stencil codes l Supercomputers: ¡ Ranger – 62,976 cores, 579 Teraflops • 2.3 GHz quad core AMD Barcelona chips ¡ Longhorn – 2,048 cores + 512 GPUs • 2.5 GHz quad core Intel Nehalem-EP chips 10

  11. Talk Outline l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion 11

  12. Multicore Performance Bottlenecks SHARED PRIVATE SINGLE CHIP L3 CACHE L1/L2 Cache L1 L1 L2 L2 NODE L3 L1 L1 L2 L2 SHARED OFF-CHIP BW SHARED DRAM PAGE CACHES LOCAL DRAM SINGLE DIMM 12

  13. Disturbances Persist Longer ?%($&'()@#5A'5,3($#)BC#5)*+,#) !"'$$"$$$% !"#,$"$$$% !"#+$"$$$% !"#$%&'()*+,#)-$.$/#01) !"#*$"$$$% !"#)$"$$$% !"#($"$$$% !"#'$"$$$% !"##$"$$$% !"#&$"$$$% !"#!$"$$$% !"#$$"$$$% !% &)% (!% *)% !$!% !&)% !(!% !*)% &$!% &&)% &(!% &*)% #$!% #&)% #(!% #*)% '$!% '&)% '(!% '*)% ($!% (&)% ((!% (*)% )$!% )&)% )(!% )*)% *$!% 234'5)6+,%/3&'()*+,#07#809):;<)=+//+'()$.$/#0)#3$>) 13

  14. Measurement Implications ?%($&'()@#5A'5,3($#)BC#5)*+,#) !"'$$"$$$% !"#,$"$$$% !"#+$"$$$% !"#$%&'()*+,#)-$.$/#01 ) !"#*$"$$$% !"#)$"$$$% -./0% !"#($"$$$% 123!% !"#'$"$$$% 123&% !"##$"$$$% !"#&$"$$$% !"#!$"$$$% !"#$$"$$$% !% &)% (!% *)% !$!% !&)% !(!% !*)% &$!% &&)% &(!% &*)% #$!% #&)% #(!% #*)% '$!% '&)% '(!% '*)% ($!% (&)% ((!% (*)% )$!% )&)% )(!% )*)% *$!% 234'5)6+,%/3&'()*+,#07#809):;<)=+//+'()$.$/#0)#3$> ) 14

  15. Measurements Must Be Lightweight Action Cycles Read Counter 9 Read Four Counters 30 Call Function 40 PAPI READ 400 System Call 5,000 TLB Page Initialization 25,000 Function Duration Calls Per Second % Exec Time 2,000 cycles or less 100,000 20% 2,000 to 10,000 cycles 20,000 10% 10K to 200K cycles 1,600 15% 200K to 1M cycles 200 15% 1M to 10M cycles - 0% 10M or more cycle 4 35% Duration of major HOMME functions 15

  16. Multicore Measurement Issues l Performance issues in shared memory system ¡ Context Sensitive ¡ Nondeterministic ¡ Highly non local l Measurement disturbance is significant ¡ Accessing memory or delaying core ¡ Hard to “bracket” measurement effects ¡ Disturbances can last billions of cycles ¡ Bottlenecks can be “bursty” l Conclusion – need multiple tools 16

  17. Talk Outline l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion 17

  18. Multicore Performance Bottlenecks SHARED SINGLE CHIP L3 CACHE L1 L1 L2 L2 NODE L3 L1 L1 L2 L2 SHARED OFF-CHIP BW SHARED DRAM PAGE CACHES LOCAL DRAM SINGLE DIMM 18

  19. Measurement Approach l Find important functions l Compare performance counters at min/max core density l Identify key multicore bottleneck: ¡ L3 capacity – L3 miss rates increase with density ¡ Off-chip BW – BW usage at min density greater than share ¡ DRAM contention – DRAM page miss rates increase with density l For small and medium functions, follow up with light weight / temporal measurements 19

  20. Important HOMME Loop do k=1,nlev do j=1,nv do i=1,nv T(i,j,k,n0) = T(i,j,k,n0) + smooth*(T(i,j,k,nm1) & - 2.0D0*T(i,j,k,n0) + T(i,j,k,np1)) v(i,j,1,k,n0) = v(i,j,1,k,n0) + smooth*(v(i,j,1,k,nm1) & - 2.0D0*v(i,j,1,k,n0) + v(i,j,1,k,np1)) v(i,j,2,k,n0) = v(i,j,2,k,n0) + smooth*(v(i,j,2,k,nm1) & - 2.0D0*v(i,j,2,k,n0) + v(i,j,2,k,np1)) div(i,j,k,n0) = div(i,j,k,n0) + smooth*(div(i,j,k,nm1) & - 2.0D0*div(i,j,k,n0) + div(i,j,k,np1)) end do end do end do 20

  21. Apply Microfission (First Line) 21

  22. Loop Microfission l Local, context free optimization l Each array processed independently ¡ Add high-level blocking to fit cache l Reduces total DRAM banks accessed ¡ Statistically reduces DRAM page miss rate l Reduces instantaneous working set size ¡ Helps with L3 capacity and off-chip BW 22

  23. Microfission Results !"#$%&'#"()*+,-./0,&1) (%!"# ($!"# (!!"# '!"# 314*# &!"# 56744789# %!"# $!"# !"# )*+,# -$#./# -0#./# )12*#./# 23

  24. Talk Outline l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion 24

Recommend


More recommend