Accelerating Exascale How the End of Moore’s Law Scaling is Changing the Machines You Use, the Way You Code, and the Algorithms You Use Steve Oberlin CTO, Accelerated Computing NVIDIA TEGRA K1 Mar 2, 2014 NVIDIA Confidential 1
A Little Time Travel 2
The Last Single-CPU Supercomputer 3
Seymour’s Last (Successful) Supercomputer 4
“Attack of the Killer Micros” 5
My Last Supercomputer 6
Future Shock 7
The Cold Equations 8
Hitting a Frequency Wall? G Bell, History of Supercomputers , LLNL, April 2013 9
How To Build A Frequency Wall Maxed out power budget Depletion of ILP End of Voltage Scaling “We’re running out of computer science...” Justin Rattner, Micro2000 presentation, 1990 10
The End of Voltage Scaling The Good Old Days The New Reality Leakage was not important, and Leakage has limited threshold voltage scaled with feature size voltage, largely ending voltage scaling L’ = L/2 L’ = L/2 V’ = V/2 V’ = ~V E’ = CV 2 = E/8 E’ = CV 2 = E/2 f’ = 2f f’ = 2f D’ = 1/L 2 = 4D D’ = 1/L2 = 4D P’ = P P’ = 4P Halve L and get 4x the transistors Halve L and get 4x the transistors and 8x the capability for and 8x the capability for the 4x the power, same power or 2x the capability for the same power in ¼ the area. 11
The End of Historic Scaling C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011 12
Chickens and Plows 13
Rise of Accelerated Computing Adoption of Accelerators GPU Accelerated Apps NVIDIA HPC Share % of HPC Customers with Accelerators 100% 300 OTHERS INTEL PHI 242 ¡ 77% 4% 11% 250 80% 182 ¡ 200 60% 44% 150 113 ¡ 40% 100 20% 50 NVIDIA GPUs 85% 0% 0 2010 2011 2012 2013 2011 2012 2013 Intersect360 Research Intersect360 HPC User Site Census: Systems, July 2013 HPC User Site Census: Systems, July 2013 IDC HPC End-User MSC Study, 2013 14
Accelerator Perf/Watt Pascal 20 18 16 SGEMM / W Normalized 14 Maxwell 12 10 8 Kepler 6 4 Fermi 2 Tesla 0 2008 2010 2012 2014 2016 15
GPUs Power World’s 10 Greenest Supercomputers Green500 MFLOPS/W Site Rank 1 4,503.17 GSIC Center, Tokyo Tech 2 3,631.86 Cambridge University 3 3,517.84 University of Tsukuba 4 3,185.91 Swiss National Supercomputing (CSCS) 5 3,130.95 ROMEO HPC Center 6 3,068.71 GSIC Center, Tokyo Tech 7 2,702.16 University of Arizona 8 2,629.10 Max-Planck 9 2,629.10 (Financial Institution) 10 2,358.69 CSIRO 37 1959.90 Intel Endeavor (top Xeon Phi cluster) 49 1247.57 Météo France (top CPU cluster) 16
The Exascale Challenge 17
The Efficiency Gap 1,000PF (50x) 20MW (2x) 50 GFLOPs/W ( 25x ) On LINPACK 6-12x Circuits and Arch 20PF 2-4x Tech 10MW 2 GFLOPs/W on LINPACK 2013 2020 Year 2013-14 2016 2020 28nm 16nm 7nm Logic Energy Scaling Factor (0.70x) 1 0.70 0.49 Wires Energy Scaling Factor (0.90x) 1 0.85 0.72 VDD (Volts) 0.9 0.80 0.75 Total Power (W) (70% Logic / 30% Wires) 100 58 38 Energy Efficiency Improvements due only to Technological Scaling 1.00 1.70 2.57 18
Pascal with HBM Stacked Memory HBM HBM • 4x Bandwidth HBM HBM GP100 HBM HBM • More Capacity HBM HBM • ¼ Power per bit passive silicon interposer Package Substrate Cross-Section View 19
NVLink TESLA Power or NVLink GPU ARM CPU 80 GB/s HBM DDR4 1 Terabyte/s 50-75 GB/s � 5x PCIe bandwidth Stacked Memory DDR Memory � Move data at CPU memory speed � 3x lower energy/bit 20
SP Energy Efficiency @ 28 nm 25 20 15 10 5 0 Fermi Kepler Maxwell 21
Cost of Computation vs. Communications 64-bit DP 20 pJ 26 pJ 256 pJ 16000 pJ DRAM Rd/Wr 256-bit access 256 8 kB SRAM bits 50 pJ 500 pJ Efficient off-chip link 1000 pJ 20mm 28nm IC 22
Cost of Computation vs. Communications 23
Cost of Computation vs. Communications SM XBAR 24
Enhanced On-Chip Signaling Standard P&R “Custom Wire” E Cost = 190 fJ/bit/mm E Cost = 145 fJ/bit/mm P AVG = 68 W P AVG = 39 W 38% of GPU Power 22% of GPU Power Delay 25mm ~ 17.0 ns Delay 25mm ~ 12.5 ns GPU Power GPU Power Compute + Compute + other other Global Global Signaling Signaling 180W continuous power and 25x25 mm die size Bi-Section Bandwidth = 6T Bytes/s nTECH’13 - John Wilson Data is moved an average of 15mm 25
Attack of the Killer Smartphones [What if there were no long wires?] 26
TEGRA K1 Unify GPU and Tegra Architecture Maxwell Kepler Fermi Tesla Tegra K1 GPU ARCHITECTURE Tegra 4 Tegra 3 MOBILE ARCHITECTURE Tesla Quadro TEGRA K1 Mobile Super Chip GeForce 192 Kepler CUDA Enabled Cores Mobile 27
JETSON TK1 Development Platform for Embedded Computer Vision, Robotics, Medical 192 Kepler Cores · 326 GFLOPS 4 ARM A15 Cores 2 GB DDR3L 16-256 GB Flash Gigabit Ethernet CUDA Enabled 5-11 Watts $192 Available Now 28
Perf/Watt Comparison � K40 + CPU � TK-1 � Peak SP: 4.2 TFLOPS � Peak SP: 326 GFLOPS � SP SGEMM: ~3.8 TFLOPS � SP SGEMM: ~290 GFLOPS � Memory: 12 GB @ 288 GB/s � Memory: 2 GB @ 14.9 GB/s � Power: � Power: � GPU: 235 W � GPU + CPU: <11 W (working hard � CPU + Mem: 150 W � 1/35 of K40 + CPU � Total: 385 W Perf/Watt: ~10 SP GFLOPS/W Perf/Watt: ~26 SP GFLOPS/W � � For the same power as K40 + CPU, you could have 10+ TFLOPS SP, 70 GB DRAM @ 500+ GB/s 29
25x or 1 Exa? 30
Likely Exascale Node Three Building Blocks (GPU, CPU, Network) GPU CPU NIC Throughput optimized, Latency optimized, 100K nodes parallel code OS, pointer chasing Direct Evolution of NVRAM 2016 Node DRAM DRAM DRAM • Programming model continuity DRAM System Interconnect MC • Specialized Cores DRAM MC Stacks MC Stacks • GPU for parallel work • CPU for serial work link link NoC NIC NoC/Bus • Coherent memory system with L2 cpu L2 0 Stacked, Bulk, & NVRAM L2 0 L2 0 L2 0 • Amortize non-parallel costs C 0 C n LOC 0 LOC 7 • Increase GPU:CPU C 0 C n C 0 C n • Smaller CPU TOC 0 C 0 C n TOC 0 TOC 0 TOC 0 31
LINPACK vs. Real Apps Oreste Villa, Scaling the Power Wall: A Path to Exascale 32
Future Programming Systems 33
A Simple Parallel Program � forall molecule in set { � forall neighbor in molecule.neighbors { � forall force in forces { � molecule.force = � reduce_sum(force(molecule, neighbor)) � } � } � } � 34
Why Is This Easy? � forall molecule in set { � forall neighbor in molecule.neighbors { � forall force in forces { � molecule.force = � reduce_sum(force(molecule, neighbor)) � } � } � No machine details } � All parallelism is expressed Synchronization is semantic (in reduction) 35
We Can Make It Hard � pid = fork() ; // explicitly managing threads � � lock(struct.lock) ; // complicated, error-prone synchronization � // manipulate struct � unlock(struct.lock) ; � � code = send(pid, tag, &msg) ; // partition across nodes � 36
Programmers, Tools, and Architecture Need to Play Their Positions � forall molecule in set { // launch a thread array � forall neighbor in molecule.neighbors { // � forall force in forces { // doubly nested � Programmer molecule.force = � reduce_sum(force(molecule, neighbor)) � } � } � } � Tools Architecture Map foralls in time and space Exposed storage hierarchy Map molecules across memories Fast comm/sync/thread mechanisms Stage data up/down hierarchy Select mechanisms 37
System Functions -> Application Optimizations � Energy Management � Power allocation among LOCs and TOCs � Resilience � Failure-tolerant applications by design 38
Conclusions 39
Exascale (25x) is Within Reach (Not so sure about Zetta-scale…) � Requires clever circuits and ruthlessly-efficient architecture � Moore’s Law cannot be relied upon � Need to exploit locality � > 100:1 global v. local energy cost � Need to expose massive concurrency Exaflop at O(GHz) clocks ⇒ O(10 billion-way) parallelism � � Need to simplify programming and automate mapping � “MPI + X” is only a step in the right direction 40
Questions? 41
Recommend
More recommend