application accelerators application accelerators
play

Application Accelerators: Application Accelerators: Application - PowerPoint PPT Presentation

Application Accelerators: Application Accelerators: Application Accelerators: Application Accelerators: Dues ex Dues ex machina machina machina ? ? Dues ex Dues ex machina CCGSC, Flat Rock, North Carolina CCGSC, Flat Rock, North


  1. Application Accelerators: Application Accelerators: Application Accelerators: Application Accelerators: Dues ex Dues ex machina machina machina ? ? Dues ex Dues ex machina CCGSC, Flat Rock, North Carolina CCGSC, Flat Rock, North Carolina CCGSC, Flat Rock, North Carolina CCGSC, Flat Rock, North Carolina Jeffrey S. Vetter Jeffrey S. Vetter Oak Ridge National Laboratory Oak Ridge National Laboratory and and Georgia Institute of Technology Georgia Institute of Technology

  2. Highlights Highlights � Background and motivation � Background and motivation – Current trends in architectures favor two strategies • Homogenous multicore • Application accelerators � Correct architecture for an application can provide � Correct architecture for an application can provide astounding results astounding results � Challenges to adopting application accelerators � Challenges to adopting application accelerators – Performance prediction – Productive software systems � Solutions from Siskiyou � Solutions from Siskiyou – Modeling assertions – Multi-paradigm procedure call 2

  3. The Drama The Drama � Years of prosperity � Years of prosperity – Increasing large-scale parallelism – Increasing number of transistors – Increasing clock speed – Stable programming models and languages � Notable constraints force a new utility function for � Notable constraints force a new utility function for architectures architectures – Signaling – Power – Heat / thermal envelope – Packaging – Memory, I/O, interconnect latency and bandwidth – Instruction level parallelism – Market trends favor ‘good enough’ computing – Economist 3

  4. Current Approaches to Current Approaches to Continue Improving Performance Continue Improving Performance � Chip Multiprocessors � Chip Multiprocessors – Homogenous multicore – Intel – AMD – IBM � Application accelerators to augment general � Application accelerators to augment general purpose multi- -cores cores purpose multi 4

  5. 5 Results from Initial Multicores Multicores Provide Performance Boost Provide Performance Boost POP Results from Initial DGEMM

  6. Quad Kilo- -core chips are on the w ay! core chips are on the w ay! Quad Kilo � 4 core chips coming � 4 core chips coming � 8 core chips likely � 8 core chips likely � ?? � ?? � Rapport � Rapport – Rapport currently offers a 256 core chip – Planning 1024 core chip in 2007 – Kilocore™ – Targeted at mobile and other consumer applications 6

  7. Enter Application Accelerators Enter Application Accelerators � Optional hardware installed to accelerate applications � Optional hardware installed to accelerate applications beyond the performance of the general purpose beyond the performance of the general purpose processor processor Intel Woodcrest NVIDIA Quadro NVIDIA GeForce IBM Cell ClearSpeed Dual Core FX 4500 GPU 6600 GPU Processor Avalon clock frequency 3.0 GHz 470 MHz 350 MHz 3.2 GHz 250 MHz type CPU accelerator card accelerator card CPU accelerator card power usage 80 W 110 W 30 W 100 W 20 W speed single / ~48 GFLOPS / 180 GFLOPS / 20 GFLOPS / 256 GFLOPS / 50 GFLOPS / double ~24 GFLOPS NA NA 25 GFLOPS 50 GFLOPS precision PCIe / MXM 1 card PCIe / MXM 1 card typical size CPU socket CPU socket PCI-X card cooling heatsink + fan heatsink + fan HS-only or HS+fan heatsink + fan HS-only 7

  8. 8 … Graphics Cards Graphics Cards For Example … For Example

  9. 9 … STI Cell STI Cell For Example … For Example

  10. 10 … ClearSpeed ClearSpeed For Example … For Example

  11. 11 … FPGAs FPGAs For Example … For Example

  12. 12 Torrenza Ecosystem Ecosystem AMD Torrenza AMD

  13. Architectures that Match Application Requirements can offer Architectures that Match Application Requirements can offer Impressive/Astounding Performance Benefits Impressive/Astounding Performance Benefits Video Imagery Geo-registration 2k x 2k Output � Geo � Geo- -registration on GPU registration on GPU 1 – 700x speedup over commodity processor � Numerous FPGA results on � Numerous FPGA results on Time (seconds) 0.1 CPU P4 2.4GHz integer, logic, flop applications integer, logic, flop applications GPU GeForce 6600 with readback GPU QuadroFX 4500 with readback GPU GeForce 6600 – 40x on Smith-Waterman GPU QuadroFX 4500 0.01 – 10x speedup on MD � HPCC � HPCC RandomAccess RandomAccess on on Cray X1E Cray X1E 0.001 512x512 1024x1024 2048x2048 Input Image Size (pixels) – 7 GUPS on 512 MSPs Arbitrary Kernel, 32-bit, 4-color 64x64 Image – 32 GUPS on 64,000 procs 0.1 Molecular Dynamics Molecular Dynamics System Seconds System Seconds 0.01 CPU P4 (debug) Time (sec) CPU P4 (opt) Cell PPE 0.425 Cell PPE 0.425 Cell SPE GeForce 6600 MTA2 2 w/32 procs w/32 procs ~0.035 0.035 MTA ~ QuadroFX 4500 0.001 2.2GHz Opteron Opteron 0.125 2.2GHz 0.125 Cell w/ 8 SPEs SPEs 0.013 Cell w/ 8 0.013 0.0001 GPU (7900GT) GPU (7900GT) 0.012 0.012 3 5 7 9 11 13 15 17 19 21 23 25 Kernel Size 13

  14. Disruptive Technologies and the S- -Curve Curve Disruptive Technologies and the S � D � Dé éj jà à vu? vu? – Floating Point Systems accelerator (1970-80s) – Weitek coprocessors (1980s) � Some differences � Some differences … … – Flops are free – Power and thermal envelopes are constraining designs 14

  15. Significant Hurdles to Adoption for Significant Hurdles to Adoption for Accelerators (and multicores multicores?) ?) Accelerators (and � Performance prediction � Performance prediction – Should my organization purchase an accelerator? – What will be the performance improvement on my application workload with the accelerator? – Is the accelerator working as we expect? – How can I optimize my application for the accelerator? � Productive software systems � Productive software systems – Do I have to rewrite my application for each accelerator? – How stable is the performance across systems? 15

  16. Performance Modeling Performance Modeling

  17. Modeling Assertions Introduction Modeling Assertions Introduction � We need new application performance modeling techniques for � We need new application performance modeling techniques for HPC to tackle scale and architectural diversity HPC to tackle scale and architectural diversity – Performance modeling is quite useful at many stages in the architecture and application development process � Existing approaches � Existing approaches – Manual • Application driven – Automated • Target architecture driven – Black box schemes—accurate but applicability to a range of applications and systems is unknown � Goals � Goals – Aim to combine analytical and empirical schemes – A framework for systematic model development – performance engineering of applications – Modular – Hierarchical – Separate application and system variables – Based on ‘user’ or ‘code developer’ input—no magical solution – Scalable—future application and system configurations 17

  18. Symbolic Performance Models w ith MA Symbolic Performance Models w ith MA Modeling Assertion (MA) = Empirical data + Symbolic modeling Declare important � � application variables Advantages over traditional Advantages over traditional modeling techniques modeling techniques – Modularity, portability and Incrementally refine Declare important extensibility model based on application operations error rates by – Parameterized, symbolic adding and models are evaluated with modifying variable and operation Matlab and Octave declarations Annotate code � � Construct, validate, and Construct, validate, and with MA API project application project application requirements as a function requirements as a function of input parameters of input parameters Validate Modeling Assertions empirically at runtime Terminate when model is representative& error level is acceptable 18

  19. MA Framew ork MA Framew ork ma(f)_subroutine_start/end Source ma(f)_loop_start/end MA API in C code ma(f)_flop_start/stop (for Fortran & annotation ma(f)_heap/stack_memory C applications Classes of API ma(f)_mpi_xxxx With MPI) calls currently ma(f)_set/unset_tracing implemented and Runtime tested system generate trace files main () { ….. loop (NAME = conj_loop) (COUNT = niter) { loop (NAME = norm_loop) (COUNT = l2npcols) Model Control { mpi_irecv (NAME = nrecv) (SIZE = dp * 2); validation flow model mpi_send (NAME = nsend) (SIZE = dp * 2); send = niter*(l2npcols*(dp*2)+l2npcols*(dp)+ Symbolic model cgitmax*(l2npcols*(dp*na/num_proc_cols)+dp*na/n um_proc_cols+l2npcols*(dp)+l2npcols*(dp))+l2npc ols*(dp*na/num_proc_cols)+dp*na/num_proc_cols+l 2npcols*(dp)) Post-processing toolset (in Java) 19

Recommend


More recommend