Evaluation and Optimization of Multicore Performance Bottlenecks in - PowerPoint PPT Presentation

Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1 , Martin Burtscher 2 , John D. McCalpin 3 , Byoung-Do Kim 3 , Stephen W. Keckler 1,4 , James C. Browne 1 1 University of Texas, 2 Texas State, 3 Texas Advanced Computing Center, 4 NVIDIA 1

Trends In Supercomputers !",&!)-&./,)(0"1,/#)(*&2&'"()&'"/-#& '!!$!!!" &#!$!!!" &!!$!!!" !"#$%&'"()*& ,-."/0123" %#!$!!!" 456"/0123" %!!$!!!" ,78"/0123" #!$!!!" !" &!!(" &!!)" &!!*" &!!+" &!%!" &!%%" +)$(& 2

!"#$%&'()"#*+",+-(.)/"%0+!1&2*+$"+ 345677+89:;<+=>>?@A7=7+ $!!"# -../0/123/4#5673/87# ,!"# $)9:;1/#:<=# +!"# $%9.;1/#:<=# *!"# >.32.;1/# )!"# Is multicore ?/@2.;1/# (!"# an issue? AB24#:;1/# '!"# CB20#:;1/# &!"# 5DEF0/#:;1/#:<=# %!"# GB0HI0/#:JDI7#</1#:<=# $!"# 8D@/4# !"# $,,'# $,,(# $,,*# $,,+# %!!!# %!!$# %!!&# %!!'# %!!)# %!!*# %!!,# %!$!# ;3J/1# 3

The Problem: Multicore Scalability 123-42,$#&!(45627&/#-8,-942(#& 95)#1-78"&!-*+85:&6#1;(1<*5-#& 95)#1-78"&!-*+85:&6#1;(1<*5-#& #,!!!" #,!!!" '%" +!!" +!!" '$" *!!" *!!" ""-./012"3400564" '#" ""-./012"3400564" )!!" )!!" "7018"391:./;" '!" (!!" (!!" !"##$%"& !"##$%"& !"##$%"& "7018"391:./;" 3<2=/;"391:./;" '!!" '!!" &" &!!" &!!" %" %!!" %!!" $" $!!" $!!" #" #!!" #!!" !" !" !" !" !" '!!" '!!" #!!!" #!!!" !" #" $" %" &" '!" '#" '$" '%" '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"& '()*#&+,-#.&/#-&0,$#& 4

The Problem: Multicore Scalability 234-5(01"&!(56137&/#-8,-953(#& 95)#1-78"&!-*+85:&6#1;(1<*5-#& 95)#1-78"&!-*+85:&6#1;(1<*5-#& #,!!!" #,!!!" &" +!!" +!!" *!!" *!!" ""-./012"3400564" %" ""-./012"3400564" )!!" )!!" "7018"391:./;" (!!" (!!" !"##$%"& !"##$%"& !"##$%"& "7018"391:./;" 3<2=/;"391:./;" '!!" '!!" $" &!!" &!!" %!!" %!!" #" $!!" $!!" #!!" #!!" !" !" !" #" $" %" &" !" !" '!!" '!!" #!!!" #!!!" '()*#&+,-#.&/#-&+01"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"& '()*+&,-./#&0(1#23&45#&0(1#&6#1&078"& 5

Optimizations Differ in Multicore '/#%01)#%2#)%034"% !"#$%&"$'(%)'$%&*+,% $!!"# ('!"# !"##$%&"%'(#)%*+,#%-%'.% (&!"# (%!"# ($!"# (!!"# '!"# &!"# %!"# $!"# !"# # # # # ( ( ( ( $ / $ / ' ) ' ) . . . . & & & & - - - - % % % % , , . . $ $ & & + + # # 1 1 + + * * " " 0 0 * * ) ) ! ! Base ¡code ¡vs ¡Mul-core ¡Op-mized ¡code ¡ 6

Paper Contributions l Studies multicore related bottlenecks l Identifies performance measurement challenges unique to multicore systems l Presents systematic approach to multicore performance analysis l Demonstrates principles of optimization 7

Talk Outline l Introduction l Approach: An HPC Case Study l Multicore Measurement Issues l Optimization Example l Conclusion 8

Approach: An HPC Case Study l Examine a real HPC application ¡ Major functions add variety l What is a typical HPC application? ¡ Many exhibit low arithmetic intensity l Typical of explicit / iterative solvers, stencils l Finite volume / elements / differences l CFD, Molecular dynamics, particle simulations, graph search, Sparse MM, etc. 9

Approach: An HPC Case Study l Application: HOMME ¡ High Order Method Modeling Environment ¡ 3-D Atmospheric Simulation from NCAR ¡ Required for NSF acceptance testing ¡ Excellent scaling, highly optimized ¡ Arithmetic Intensity typical of stencil codes l Supercomputers: ¡ Ranger – 62,976 cores, 579 Teraflops • 2.3 GHz quad core AMD Barcelona chips ¡ Longhorn – 2,048 cores + 512 GPUs • 2.5 GHz quad core Intel Nehalem-EP chips 10

Multicore Performance Bottlenecks SHARED PRIVATE SINGLE CHIP L3 CACHE L1/L2 Cache L1 L1 L2 L2 NODE L3 L1 L1 L2 L2 SHARED OFF-CHIP BW SHARED DRAM PAGE CACHES LOCAL DRAM SINGLE DIMM 12

Disturbances Persist Longer ?%($&'()@#5A'5,3($#)BC#5)*+,#) !"'$$"$$$% !"#,$"$$$% !"#+$"$$$% !"#$%&'()*+,#)-$.$/#01) !"#*$"$$$% !"#)$"$$$% !"#($"$$$% !"#'$"$$$% !"##$"$$$% !"#&$"$$$% !"#!$"$$$% !"#$$"$$$% !% &)% (!% *)% !$!% !&)% !(!% !*)% &$!% &&)% &(!% &*)% #$!% #&)% #(!% #*)% '$!% '&)% '(!% '*)% ($!% (&)% ((!% (*)% )$!% )&)% )(!% )*)% *$!% 234'5)6+,%/3&'()*+,#07#809):;<)=+//+'()$.$/#0)#3$>) 13

Measurement Implications ?%($&'()@#5A'5,3($#)BC#5)*+,#) !"'$$"$$$% !"#,$"$$$% !"#+$"$$$% !"#$%&'()*+,#)-$.$/#01 ) !"#*$"$$$% !"#)$"$$$% -./0% !"#($"$$$% 123!% !"#'$"$$$% 123&% !"##$"$$$% !"#&$"$$$% !"#!$"$$$% !"#$$"$$$% !% &)% (!% *)% !$!% !&)% !(!% !*)% &$!% &&)% &(!% &*)% #$!% #&)% #(!% #*)% '$!% '&)% '(!% '*)% ($!% (&)% ((!% (*)% )$!% )&)% )(!% )*)% *$!% 234'5)6+,%/3&'()*+,#07#809):;<)=+//+'()$.$/#0)#3$> ) 14

Measurements Must Be Lightweight Action Cycles Read Counter 9 Read Four Counters 30 Call Function 40 PAPI READ 400 System Call 5,000 TLB Page Initialization 25,000 Function Duration Calls Per Second % Exec Time 2,000 cycles or less 100,000 20% 2,000 to 10,000 cycles 20,000 10% 10K to 200K cycles 1,600 15% 200K to 1M cycles 200 15% 1M to 10M cycles - 0% 10M or more cycle 4 35% Duration of major HOMME functions 15

Multicore Measurement Issues l Performance issues in shared memory system ¡ Context Sensitive ¡ Nondeterministic ¡ Highly non local l Measurement disturbance is significant ¡ Accessing memory or delaying core ¡ Hard to “bracket” measurement effects ¡ Disturbances can last billions of cycles ¡ Bottlenecks can be “bursty” l Conclusion – need multiple tools 16

Multicore Performance Bottlenecks SHARED SINGLE CHIP L3 CACHE L1 L1 L2 L2 NODE L3 L1 L1 L2 L2 SHARED OFF-CHIP BW SHARED DRAM PAGE CACHES LOCAL DRAM SINGLE DIMM 18

Measurement Approach l Find important functions l Compare performance counters at min/max core density l Identify key multicore bottleneck: ¡ L3 capacity – L3 miss rates increase with density ¡ Off-chip BW – BW usage at min density greater than share ¡ DRAM contention – DRAM page miss rates increase with density l For small and medium functions, follow up with light weight / temporal measurements 19

Important HOMME Loop do k=1,nlev do j=1,nv do i=1,nv T(i,j,k,n0) = T(i,j,k,n0) + smooth*(T(i,j,k,nm1) & - 2.0D0*T(i,j,k,n0) + T(i,j,k,np1)) v(i,j,1,k,n0) = v(i,j,1,k,n0) + smooth*(v(i,j,1,k,nm1) & - 2.0D0*v(i,j,1,k,n0) + v(i,j,1,k,np1)) v(i,j,2,k,n0) = v(i,j,2,k,n0) + smooth*(v(i,j,2,k,nm1) & - 2.0D0*v(i,j,2,k,n0) + v(i,j,2,k,np1)) div(i,j,k,n0) = div(i,j,k,n0) + smooth*(div(i,j,k,nm1) & - 2.0D0*div(i,j,k,n0) + div(i,j,k,np1)) end do end do end do 20

Apply Microfission (First Line) 21

Loop Microfission l Local, context free optimization l Each array processed independently ¡ Add high-level blocking to fit cache l Reduces total DRAM banks accessed ¡ Statistically reduces DRAM page miss rate l Reduces instantaneous working set size ¡ Helps with L3 capacity and off-chip BW 22

Microfission Results !"#$%&'#"()*+,-./0,&1) (%!"# ($!"# (!!"# '!"# 314*# &!"# 56744789# %!"# $!"# !"# )*+,# -$#./# -0#./# )12*#./# 23

Evaluation and Optimization of Multicore Performance Bottlenecks in - PowerPoint PPT Presentation

Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1 , Martin Burtscher 2 , John D. McCalpin 3 , Byoung-Do Kim 3 , Stephen W. Keckler 1,4 , James C. Browne 1 1 University of Texas, 2 Texas

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

The Challenge of Multicore The Challenge of Multicore and and Specialized Accelerators for

The Impact of Multicore Multicore on on The Impact of Math Software Math Software and and

Computer Architecture Summer 2020 Multicore Dan Sorin and Tyler Bletsch Duke University

The Impact of Multicore Multicore on Math Software on Math Software The Impact of and

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA

Practical Algebraic Effect Handlers in Multicore OCaml KC Sivaramakrishnan University of

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Multicore job management in the Multicore job management in the Worldwide LHC Computing Grid

Debunking Fault Injection Myths and Misconceptions Cristofaro Mune Niek Timmers

Cool Chips Cool Chips Markets Markets Cool Cargo Applications Cool Cargo Applications

Cool Chips Cool Chips Markets Markets Semiconductor Fabrication Semiconductor

Small is beautiful 1. Cramming More Components Onto Integrated Circuits, G.E. Moore, 1965 Where

Analysis and Approximation of Optimal Co-Scheduling on Chip Multiprocessors Yunlian Jiang

Design of a Simple Computer 2 Schedule Today

SYSC3601 Microprocessor Systems Unit 5: Memory Structures and Interfacing SYSC3601 1

ASIC Clouds: Specializing the Datacenter Ikuo Magaki, Moein Khazraee, Luis Vega Gutierrez, and

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Evaluation and Optimization of Multicore Performance Bottlenecks in - PowerPoint PPT Presentation

Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1 , Martin Burtscher 2 , John D. McCalpin 3 , Byoung-Do Kim 3 , Stephen W. Keckler 1,4 , James C. Browne 1 1 University of Texas, 2 Texas

The Why, Where and How of Multicore Anant Agarwal MIT and Tilera Corp. What is Multicore?

State of Multicore OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge Outline

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Multicore OCaml GC KC Sivaramakrishnan, Stephen Dolan University of OCaml Labs Cambridge

Multicore Synchronization a pragmatic introduction Multicore Synchronization This is a talk on

RETHINKING OPERATING SYSTEM DESIGNS FOR A Ken Birman Based heavily MULTICORE WORLD on a slide

The Challenge of Multicore The Challenge of Multicore and and Specialized Accelerators for

The Impact of Multicore Multicore on on The Impact of Math Software Math Software and and

Computer Architecture Summer 2020 Multicore Dan Sorin and Tyler Bletsch Duke University

The Impact of Multicore Multicore on Math Software on Math Software The Impact of and

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly

CS 240A: Shared Memory &amp; Multicore Programming with Cilk++ Multicore and NUMA

Practical Algebraic Effect Handlers in Multicore OCaml KC Sivaramakrishnan University of

A Scalable Ordering Primitive for Multicore Machines Sanidhya Kashyap Changwoo Min Kangnyeon Kim

Reactive design patterns for microservices on multicore Reactive summit - 22/10/18

Multicore job management in the Multicore job management in the Worldwide LHC Computing Grid

Debunking Fault Injection Myths and Misconceptions Cristofaro Mune Niek Timmers

Cool Chips Cool Chips Markets Markets Cool Cargo Applications Cool Cargo Applications

Cool Chips Cool Chips Markets Markets Semiconductor Fabrication Semiconductor

Small is beautiful 1. Cramming More Components Onto Integrated Circuits, G.E. Moore, 1965 Where

Analysis and Approximation of Optimal Co-Scheduling on Chip Multiprocessors Yunlian Jiang

Design of a Simple Computer 2 Schedule Today

SYSC3601 Microprocessor Systems Unit 5: Memory Structures and Interfacing SYSC3601 1

ASIC Clouds: Specializing the Datacenter Ikuo Magaki, Moein Khazraee, Luis Vega Gutierrez, and

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

CS 240A: Shared Memory & Multicore Programming with Cilk++ Multicore and NUMA